linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges
@ 2021-10-22 10:08 Li Chen
  2021-10-25  9:02 ` Li Chen
  2021-10-25 15:47 ` Bjorn Helgaas
  0 siblings, 2 replies; 14+ messages in thread
From: Li Chen @ 2021-10-22 10:08 UTC (permalink / raw)
  To: linux-pci; +Cc: Lorenzo Pieralisi, Rob Herring, kw, Bjorn Helgaas, linux-kernel

Hi, all

I found my nvme may get timeout with simply dd, the pcie controller is cadence, and the pcie-controller platform I use is pcie-cadence-host.c, kenrel version is 5.10.32:

# dd if=/dev/urandom of=/dev/nvme0n1 bs=4k count=1024000
[   41.913484][  T274] urandom_read: 2 callbacks suppressed
[   41.913490][  T274] random: dd: uninitialized urandom read (4096 bytes read)
[   41.926130][  T274] random: dd: uninitialized urandom read (4096 bytes read)
[   41.933348][  T274] random: dd: uninitialized urandom read (4096 bytes read)
[   47.651842][    C0] random: crng init done
[   47.655963][    C0] random: 2 urandom warning(s) missed due to ratelimiting
[   81.448128][   T64] nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x2010
[   81.481139][    T7] nvme 0000:05:00.0: enabling bus mastering
[   81.486946][    T7] nvme 0000:05:00.0: saving config space at offset 0x0 (reading 0xa809144d)
[   81.495517][    T7] nvme 0000:05:00.0: saving config space at offset 0x4 (reading 0x20100006)
[   81.504091][    T7] nvme 0000:05:00.0: saving config space at offset 0x8 (reading 0x1080200)
[   81.512571][    T7] nvme 0000:05:00.0: saving config space at offset 0xc (reading 0x0)
[   81.520527][    T7] nvme 0000:05:00.0: saving config space at offset 0x10 (reading 0x8000004)
[   81.529094][    T7] nvme 0000:05:00.0: saving config space at offset 0x14 (reading 0x0)
[   81.537138][    T7] nvme 0000:05:00.0: saving config space at offset 0x18 (reading 0x0)
[   81.545186][    T7] nvme 0000:05:00.0: saving config space at offset 0x1c (reading 0x0)
[   81.553252][    T7] nvme 0000:05:00.0: saving config space at offset 0x20 (reading 0x0)
[   81.561296][    T7] nvme 0000:05:00.0: saving config space at offset 0x24 (reading 0x0)
[   81.569340][    T7] nvme 0000:05:00.0: saving config space at offset 0x28 (reading 0x0)
[   81.577384][    T7] nvme 0000:05:00.0: saving config space at offset 0x2c (reading 0xa801144d)
[   81.586038][    T7] nvme 0000:05:00.0: saving config space at offset 0x30 (reading 0x0)
[   81.594081][    T7] nvme 0000:05:00.0: saving config space at offset 0x34 (reading 0x40)
[   81.602217][    T7] nvme 0000:05:00.0: saving config space at offset 0x38 (reading 0x0)
[   81.610266][    T7] nvme 0000:05:00.0: saving config space at offset 0x3c (reading 0x12c)
[   81.634065][    T7] nvme nvme0: Shutdown timeout set to 8 seconds
[   81.674332][    T7] nvme nvme0: 1/0/0 default/read/poll queues
[  112.168136][  T256] nvme nvme0: I/O 12 QID 1 timeout, disable controller
[  112.193145][  T256] blk_update_request: I/O error, dev nvme0n1, sector 600656 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
[  112.205220][  T256] Buffer I/O error on dev nvme0n1, logical block 75082, lost async page write
[  112.213978][  T256] Buffer I/O error on dev nvme0n1, logical block 75083, lost async page write
[  112.222727][  T256] Buffer I/O error on dev nvme0n1, logical block 75084, lost async page write
[  112.231474][  T256] Buffer I/O error on dev nvme0n1, logical block 75085, lost async page write
[  112.240220][  T256] Buffer I/O error on dev nvme0n1, logical block 75086, lost async page write
[  112.248966][  T256] Buffer I/O error on dev nvme0n1, logical block 75087, lost async page write
[  112.257719][  T256] Buffer I/O error on dev nvme0n1, logical block 75088, lost async page write
[  112.266467][  T256] Buffer I/O error on dev nvme0n1, logical block 75089, lost async page write
[  112.275213][  T256] Buffer I/O error on dev nvme0n1, logical block 75090, lost async page write
[  112.283959][  T256] Buffer I/O error on dev nvme0n1, logical block 75091, lost async page write
[  112.293554][  T256] blk_update_request: I/O error, dev nvme0n1, sector 601672 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
[  112.306559][  T256] blk_update_request: I/O error, dev nvme0n1, sector 602688 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
[  112.319525][  T256] blk_update_request: I/O error, dev nvme0n1, sector 603704 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
[  112.332501][  T256] blk_update_request: I/O error, dev nvme0n1, sector 604720 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
[  112.345466][  T256] blk_update_request: I/O error, dev nvme0n1, sector 605736 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
[  112.358427][  T256] blk_update_request: I/O error, dev nvme0n1, sector 606752 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
[  112.371386][  T256] blk_update_request: I/O error, dev nvme0n1, sector 607768 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
[  112.384346][  T256] blk_update_request: I/O error, dev nvme0n1, sector 608784 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
[  112.397315][  T256] blk_update_request: I/O error, dev nvme0n1, sector 609800 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
[  112.459313][    T7] nvme nvme0: failed to mark controller live state
[  112.465758][    T7] nvme nvme0: Removing after probe failure status: -19
dd: error writing '/dev/nvme0n1': No space left on device
112200+0 records in
112199+0 records out
459567104 bytes (438.3MB) copied, 70.573266 seconds, 6.2MB/s
# [  112.935768][    T7] nvme nvme0: failed to set APST feature (-19)




Here is the dts I used:
pciec: pcie-controller@2040000000 {
                                compatible = "cdns,cdns-pcie-host";
		device_type = "pci";
		#address-cells = <3>;
		#size-cells = <2>;
		bus-range = <0 5>;
		linux,pci-domain = <0>;
		cdns,no-bar-match-nbits = <38>;
		vendor-id = <0x17cd>;
		device-id = <0x0100>;
		reg-names = "reg", "cfg";
		reg = <0x20 0x40000000 0x0 0x10000000>,
		      <0x20 0x00000000 0x0 0x00001000>;	/* RC only */

		/*
		 * type: 0x00000000 cfg space
		 * type: 0x01000000 IO
		 * type: 0x02000000 32bit mem space No prefetch
		 * type: 0x03000000 64bit mem space No prefetch
		 * type: 0x43000000 64bit mem space prefetch
		 * The First 16MB from BUS_DEV_FUNC=0:0:0 for cfg space
		 * <0x00000000 0x00 0x00000000 0x20 0x00000000 0x00 0x01000000>, CFG_SPACE
		*/
		ranges = <0x01000000 0x00 0x00000000 0x20 0x00100000 0x00 0x00100000>,
			 <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x08000000>;

		#interrupt-cells = <0x1>;
		interrupt-map-mask = <0x00 0x0 0x0 0x7>;
		interrupt-map = <0x0 0x0 0x0 0x1 &gic 0 229 0x4>,
				<0x0 0x0 0x0 0x2 &gic 0 230 0x4>,
				<0x0 0x0 0x0 0x3 &gic 0 231 0x4>,
				<0x0 0x0 0x0 0x4 &gic 0 232 0x4>;
		phys = <&pcie_phy>;
		phy-names="pcie-phy";
		status = "ok";
	};


After some digging, I find if I change the controller's range property from  <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x08000000> into <0x02000000 0x00 0x00400000 0x20 0x00400000 0x00 0x08000000>, then dd will success without timeout. IIUC, range here is only for non-prefetch 32bit mmio, but dd will use dma(maybe cpu will send cmd to nvme controller via mmio?).

Question:
1.  Why dd can cause nvme timeout? Is there more debug ways?
2. How can this mmio range affect nvme timeout?

Regards,
Li

**********************************************************************
This email and attachments contain Ambarella Proprietary and/or Confidential Information and is intended solely for the use of the individual(s) to whom it is addressed. Any unauthorized review, use, disclosure, distribute, copy, or print is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy all copies of the original message. Thank you.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges
  2021-10-22 10:08 nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges Li Chen
@ 2021-10-25  9:02 ` Li Chen
  2021-10-25 15:47 ` Bjorn Helgaas
  1 sibling, 0 replies; 14+ messages in thread
From: Li Chen @ 2021-10-25  9:02 UTC (permalink / raw)
  To: linux-pci; +Cc: Lorenzo Pieralisi, Rob Herring, kw, Bjorn Helgaas, linux-kernel

> [   81.634065][    T7] nvme nvme0: Shutdown timeout set to 8 seconds
> [   81.674332][    T7] nvme nvme0: 1/0/0 default/read/poll queues
> [  112.168136][  T256] nvme nvme0: I/O 12 QID 1 timeout, disable controller

I/O 12 QID 1 is nvme_cmd_write, reproduce after rebooting and using ftrace:

# dmesg | grep QID
[   99.456260] nvme nvme0: I/O 128 QID 1 timeout, disable controller
# grep "qid=1" /sys/kernel/tracing/trace | egrep "cmdid=128\b"
    kworker/0:1H-65      [000] ....    68.987492: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=1, cmdid=128, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=216856, len=1015, ctrl=0x0, dsmgmt=0, reftag=0)
          <idle>-0       [001] ..s1    99.475164: nvme_complete_rq: nvme0: disk=nvme0n1, qid=1, cmdid=128, res=0x0, retries=1, flags=0x1, status=0x371

Regards,
Li

**********************************************************************
This email and attachments contain Ambarella Proprietary and/or Confidential Information and is intended solely for the use of the individual(s) to whom it is addressed. Any unauthorized review, use, disclosure, distribute, copy, or print is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy all copies of the original message. Thank you.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges
  2021-10-22 10:08 nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges Li Chen
  2021-10-25  9:02 ` Li Chen
@ 2021-10-25 15:47 ` Bjorn Helgaas
  2021-10-25 16:21   ` Keith Busch
  2021-10-26  4:33   ` Li Chen
  1 sibling, 2 replies; 14+ messages in thread
From: Bjorn Helgaas @ 2021-10-25 15:47 UTC (permalink / raw)
  To: Li Chen
  Cc: linux-pci, Lorenzo Pieralisi, Rob Herring, kw, Bjorn Helgaas,
	linux-kernel, Tom Joseph, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme

[+cc Tom (Cadence maintainer), NVMe folks]

On Fri, Oct 22, 2021 at 10:08:20AM +0000, Li Chen wrote:
> Hi, all
> 
> I found my nvme may get timeout with simply dd, the pcie controller
> is cadence, and the pcie-controller platform I use is
> pcie-cadence-host.c, kenrel version is 5.10.32:

Given that nvme is pretty widely used, I tend to suspect the cadence
driver or DTS here, but I added both cadence and nvme folks to the CC:
list.

> # dd if=/dev/urandom of=/dev/nvme0n1 bs=4k count=1024000
> [   41.913484][  T274] urandom_read: 2 callbacks suppressed
> [   41.913490][  T274] random: dd: uninitialized urandom read (4096 bytes read)
> [   41.926130][  T274] random: dd: uninitialized urandom read (4096 bytes read)
> [   41.933348][  T274] random: dd: uninitialized urandom read (4096 bytes read)
> [   47.651842][    C0] random: crng init done
> [   47.655963][    C0] random: 2 urandom warning(s) missed due to ratelimiting
> [   81.448128][   T64] nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x2010
> [   81.481139][    T7] nvme 0000:05:00.0: enabling bus mastering
> [   81.486946][    T7] nvme 0000:05:00.0: saving config space at offset 0x0 (reading 0xa809144d)
> [   81.495517][    T7] nvme 0000:05:00.0: saving config space at offset 0x4 (reading 0x20100006)
> [   81.504091][    T7] nvme 0000:05:00.0: saving config space at offset 0x8 (reading 0x1080200)
> [   81.512571][    T7] nvme 0000:05:00.0: saving config space at offset 0xc (reading 0x0)
> [   81.520527][    T7] nvme 0000:05:00.0: saving config space at offset 0x10 (reading 0x8000004)
> [   81.529094][    T7] nvme 0000:05:00.0: saving config space at offset 0x14 (reading 0x0)
> [   81.537138][    T7] nvme 0000:05:00.0: saving config space at offset 0x18 (reading 0x0)
> [   81.545186][    T7] nvme 0000:05:00.0: saving config space at offset 0x1c (reading 0x0)
> [   81.553252][    T7] nvme 0000:05:00.0: saving config space at offset 0x20 (reading 0x0)
> [   81.561296][    T7] nvme 0000:05:00.0: saving config space at offset 0x24 (reading 0x0)
> [   81.569340][    T7] nvme 0000:05:00.0: saving config space at offset 0x28 (reading 0x0)
> [   81.577384][    T7] nvme 0000:05:00.0: saving config space at offset 0x2c (reading 0xa801144d)
> [   81.586038][    T7] nvme 0000:05:00.0: saving config space at offset 0x30 (reading 0x0)
> [   81.594081][    T7] nvme 0000:05:00.0: saving config space at offset 0x34 (reading 0x40)
> [   81.602217][    T7] nvme 0000:05:00.0: saving config space at offset 0x38 (reading 0x0)
> [   81.610266][    T7] nvme 0000:05:00.0: saving config space at offset 0x3c (reading 0x12c)
> [   81.634065][    T7] nvme nvme0: Shutdown timeout set to 8 seconds
> [   81.674332][    T7] nvme nvme0: 1/0/0 default/read/poll queues
> [  112.168136][  T256] nvme nvme0: I/O 12 QID 1 timeout, disable controller
> [  112.193145][  T256] blk_update_request: I/O error, dev nvme0n1, sector 600656 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> [  112.205220][  T256] Buffer I/O error on dev nvme0n1, logical block 75082, lost async page write
> [  112.213978][  T256] Buffer I/O error on dev nvme0n1, logical block 75083, lost async page write
> [  112.222727][  T256] Buffer I/O error on dev nvme0n1, logical block 75084, lost async page write
> [  112.231474][  T256] Buffer I/O error on dev nvme0n1, logical block 75085, lost async page write
> [  112.240220][  T256] Buffer I/O error on dev nvme0n1, logical block 75086, lost async page write
> [  112.248966][  T256] Buffer I/O error on dev nvme0n1, logical block 75087, lost async page write
> [  112.257719][  T256] Buffer I/O error on dev nvme0n1, logical block 75088, lost async page write
> [  112.266467][  T256] Buffer I/O error on dev nvme0n1, logical block 75089, lost async page write
> [  112.275213][  T256] Buffer I/O error on dev nvme0n1, logical block 75090, lost async page write
> [  112.283959][  T256] Buffer I/O error on dev nvme0n1, logical block 75091, lost async page write
> [  112.293554][  T256] blk_update_request: I/O error, dev nvme0n1, sector 601672 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> [  112.306559][  T256] blk_update_request: I/O error, dev nvme0n1, sector 602688 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> [  112.319525][  T256] blk_update_request: I/O error, dev nvme0n1, sector 603704 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> [  112.332501][  T256] blk_update_request: I/O error, dev nvme0n1, sector 604720 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> [  112.345466][  T256] blk_update_request: I/O error, dev nvme0n1, sector 605736 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> [  112.358427][  T256] blk_update_request: I/O error, dev nvme0n1, sector 606752 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> [  112.371386][  T256] blk_update_request: I/O error, dev nvme0n1, sector 607768 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> [  112.384346][  T256] blk_update_request: I/O error, dev nvme0n1, sector 608784 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> [  112.397315][  T256] blk_update_request: I/O error, dev nvme0n1, sector 609800 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> [  112.459313][    T7] nvme nvme0: failed to mark controller live state
> [  112.465758][    T7] nvme nvme0: Removing after probe failure status: -19
> dd: error writing '/dev/nvme0n1': No space left on device
> 112200+0 records in
> 112199+0 records out
> 459567104 bytes (438.3MB) copied, 70.573266 seconds, 6.2MB/s
> # [  112.935768][    T7] nvme nvme0: failed to set APST feature (-19)
> 
> 
> 
> 
> Here is the dts I used:
> pciec: pcie-controller@2040000000 {
>                                 compatible = "cdns,cdns-pcie-host";
> 		device_type = "pci";
> 		#address-cells = <3>;
> 		#size-cells = <2>;
> 		bus-range = <0 5>;
> 		linux,pci-domain = <0>;
> 		cdns,no-bar-match-nbits = <38>;
> 		vendor-id = <0x17cd>;
> 		device-id = <0x0100>;
> 		reg-names = "reg", "cfg";
> 		reg = <0x20 0x40000000 0x0 0x10000000>,
> 		      <0x20 0x00000000 0x0 0x00001000>;	/* RC only */
> 
> 		/*
> 		 * type: 0x00000000 cfg space
> 		 * type: 0x01000000 IO
> 		 * type: 0x02000000 32bit mem space No prefetch
> 		 * type: 0x03000000 64bit mem space No prefetch
> 		 * type: 0x43000000 64bit mem space prefetch
> 		 * The First 16MB from BUS_DEV_FUNC=0:0:0 for cfg space
> 		 * <0x00000000 0x00 0x00000000 0x20 0x00000000 0x00 0x01000000>, CFG_SPACE
> 		*/
> 		ranges = <0x01000000 0x00 0x00000000 0x20 0x00100000 0x00 0x00100000>,
> 			 <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x08000000>;
> 
> 		#interrupt-cells = <0x1>;
> 		interrupt-map-mask = <0x00 0x0 0x0 0x7>;
> 		interrupt-map = <0x0 0x0 0x0 0x1 &gic 0 229 0x4>,
> 				<0x0 0x0 0x0 0x2 &gic 0 230 0x4>,
> 				<0x0 0x0 0x0 0x3 &gic 0 231 0x4>,
> 				<0x0 0x0 0x0 0x4 &gic 0 232 0x4>;
> 		phys = <&pcie_phy>;
> 		phy-names="pcie-phy";
> 		status = "ok";
> 	};
> 
> 
> After some digging, I find if I change the controller's range
> property from
>
> <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x08000000> into
> <0x02000000 0x00 0x00400000 0x20 0x00400000 0x00 0x08000000>,
>
> then dd will success without timeout. IIUC, range here
> is only for non-prefetch 32bit mmio, but dd will use dma (maybe cpu
> will send cmd to nvme controller via mmio?).

I don't know how to interpret "ranges".  Can you supply the dmesg and
"lspci -vvs 0000:05:00.0" output both ways, e.g., 

  pci_bus 0000:00: root bus resource [mem 0x7f800000-0xefffffff window]
  pci_bus 0000:00: root bus resource [mem 0xfd000000-0xfe7fffff window]
  pci 0000:05:00.0: [vvvv:dddd] type 00 class 0x...
  pci 0000:05:00.0: reg 0x10: [mem 0x.....000-0x.....fff ...]

> Question:
> 1.  Why dd can cause nvme timeout? Is there more debug ways?
> 2. How can this mmio range affect nvme timeout?
> 
> Regards,
> Li
> 
> **********************************************************************
> This email and attachments contain Ambarella Proprietary and/or Confidential Information and is intended solely for the use of the individual(s) to whom it is addressed. Any unauthorized review, use, disclosure, distribute, copy, or print is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy all copies of the original message. Thank you.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges
  2021-10-25 15:47 ` Bjorn Helgaas
@ 2021-10-25 16:21   ` Keith Busch
  2021-10-26  3:40     ` [EXT] " Li Chen
  2021-10-26  4:33   ` Li Chen
  1 sibling, 1 reply; 14+ messages in thread
From: Keith Busch @ 2021-10-25 16:21 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Li Chen, linux-pci, Lorenzo Pieralisi, Rob Herring, kw,
	Bjorn Helgaas, linux-kernel, Tom Joseph, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme

On Mon, Oct 25, 2021 at 10:47:39AM -0500, Bjorn Helgaas wrote:
> [+cc Tom (Cadence maintainer), NVMe folks]
> 
> On Fri, Oct 22, 2021 at 10:08:20AM +0000, Li Chen wrote:
> > pciec: pcie-controller@2040000000 {
> >                                 compatible = "cdns,cdns-pcie-host";
> > 		device_type = "pci";
> > 		#address-cells = <3>;
> > 		#size-cells = <2>;
> > 		bus-range = <0 5>;
> > 		linux,pci-domain = <0>;
> > 		cdns,no-bar-match-nbits = <38>;
> > 		vendor-id = <0x17cd>;
> > 		device-id = <0x0100>;
> > 		reg-names = "reg", "cfg";
> > 		reg = <0x20 0x40000000 0x0 0x10000000>,
> > 		      <0x20 0x00000000 0x0 0x00001000>;	/* RC only */
> > 
> > 		/*
> > 		 * type: 0x00000000 cfg space
> > 		 * type: 0x01000000 IO
> > 		 * type: 0x02000000 32bit mem space No prefetch
> > 		 * type: 0x03000000 64bit mem space No prefetch
> > 		 * type: 0x43000000 64bit mem space prefetch
> > 		 * The First 16MB from BUS_DEV_FUNC=0:0:0 for cfg space
> > 		 * <0x00000000 0x00 0x00000000 0x20 0x00000000 0x00 0x01000000>, CFG_SPACE
> > 		*/
> > 		ranges = <0x01000000 0x00 0x00000000 0x20 0x00100000 0x00 0x00100000>,
> > 			 <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x08000000>;
> > 
> > 		#interrupt-cells = <0x1>;
> > 		interrupt-map-mask = <0x00 0x0 0x0 0x7>;
> > 		interrupt-map = <0x0 0x0 0x0 0x1 &gic 0 229 0x4>,
> > 				<0x0 0x0 0x0 0x2 &gic 0 230 0x4>,
> > 				<0x0 0x0 0x0 0x3 &gic 0 231 0x4>,
> > 				<0x0 0x0 0x0 0x4 &gic 0 232 0x4>;
> > 		phys = <&pcie_phy>;
> > 		phy-names="pcie-phy";
> > 		status = "ok";
> > 	};
> > 
> > 
> > After some digging, I find if I change the controller's range
> > property from
> >
> > <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x08000000> into
> > <0x02000000 0x00 0x00400000 0x20 0x00400000 0x00 0x08000000>,
> >
> > then dd will success without timeout. IIUC, range here
> > is only for non-prefetch 32bit mmio, but dd will use dma (maybe cpu
> > will send cmd to nvme controller via mmio?).

Generally speaking, an nvme driver notifies the controller of new
commands via a MMIO write to a specific nvme register. The nvme
controller fetches those commands from host memory with a DMA.

One exception to that description is if the nvme controller supports CMB
with SQEs, but they're not very common. If you had such a controller,
the driver will use MMIO to write commands directly into controller
memory instead of letting the controller DMA them from host memory. Do
you know if you have such a controller?

The data transfers associated with your 'dd' command will always use DMA.

> I don't know how to interpret "ranges".  Can you supply the dmesg and
> "lspci -vvs 0000:05:00.0" output both ways, e.g., 
> 
>   pci_bus 0000:00: root bus resource [mem 0x7f800000-0xefffffff window]
>   pci_bus 0000:00: root bus resource [mem 0xfd000000-0xfe7fffff window]
>   pci 0000:05:00.0: [vvvv:dddd] type 00 class 0x...
>   pci 0000:05:00.0: reg 0x10: [mem 0x.....000-0x.....fff ...]
> 
> > Question:
> > 1.  Why dd can cause nvme timeout? Is there more debug ways?

That means the nvme controller didn't provide a response to a posted
command within the driver's latency tolerance.

> > 2. How can this mmio range affect nvme timeout?

Let's see how those ranges affect what the kernel sees in the pci
topology, as Bjorn suggested.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [EXT] Re: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges
  2021-10-25 16:21   ` Keith Busch
@ 2021-10-26  3:40     ` Li Chen
  2021-10-26  4:15       ` Keith Busch
  0 siblings, 1 reply; 14+ messages in thread
From: Li Chen @ 2021-10-26  3:40 UTC (permalink / raw)
  To: Keith Busch, Bjorn Helgaas
  Cc: linux-pci, Lorenzo Pieralisi, Rob Herring, kw, Bjorn Helgaas,
	linux-kernel, Tom Joseph, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, linux-nvme

Hi, Keith and Bjorn

> -----Original Message-----
> From: Keith Busch [mailto:kbusch@kernel.org]
> Sent: Tuesday, October 26, 2021 12:22 AM
> To: Bjorn Helgaas
> Cc: Li Chen; linux-pci@vger.kernel.org; Lorenzo Pieralisi; Rob Herring;
> kw@linux.com; Bjorn Helgaas; linux-kernel@vger.kernel.org; Tom Joseph; Jens
> Axboe; Christoph Hellwig; Sagi Grimberg; linux-nvme@lists.infradead.org
> Subject: [EXT] Re: nvme may get timeout from dd when using different non-
> prefetch mmio outbound/ranges
> 
> On Mon, Oct 25, 2021 at 10:47:39AM -0500, Bjorn Helgaas wrote:
> > [+cc Tom (Cadence maintainer), NVMe folks]
> >
> > On Fri, Oct 22, 2021 at 10:08:20AM +0000, Li Chen wrote:
> > > pciec: pcie-controller@2040000000 {
> > >                                 compatible = "cdns,cdns-pcie-host";
> > > 		device_type = "pci";
> > > 		#address-cells = <3>;
> > > 		#size-cells = <2>;
> > > 		bus-range = <0 5>;
> > > 		linux,pci-domain = <0>;
> > > 		cdns,no-bar-match-nbits = <38>;
> > > 		vendor-id = <0x17cd>;
> > > 		device-id = <0x0100>;
> > > 		reg-names = "reg", "cfg";
> > > 		reg = <0x20 0x40000000 0x0 0x10000000>,
> > > 		      <0x20 0x00000000 0x0 0x00001000>;	/* RC only */
> > >
> > > 		/*
> > > 		 * type: 0x00000000 cfg space
> > > 		 * type: 0x01000000 IO
> > > 		 * type: 0x02000000 32bit mem space No prefetch
> > > 		 * type: 0x03000000 64bit mem space No prefetch
> > > 		 * type: 0x43000000 64bit mem space prefetch
> > > 		 * The First 16MB from BUS_DEV_FUNC=0:0:0 for cfg space
> > > 		 * <0x00000000 0x00 0x00000000 0x20 0x00000000 0x00
> 0x01000000>, CFG_SPACE
> > > 		*/
> > > 		ranges = <0x01000000 0x00 0x00000000 0x20 0x00100000 0x00
> 0x00100000>,
> > > 			 <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00
> 0x08000000>;
> > >
> > > 		#interrupt-cells = <0x1>;
> > > 		interrupt-map-mask = <0x00 0x0 0x0 0x7>;
> > > 		interrupt-map = <0x0 0x0 0x0 0x1 &gic 0 229 0x4>,
> > > 				<0x0 0x0 0x0 0x2 &gic 0 230 0x4>,
> > > 				<0x0 0x0 0x0 0x3 &gic 0 231 0x4>,
> > > 				<0x0 0x0 0x0 0x4 &gic 0 232 0x4>;
> > > 		phys = <&pcie_phy>;
> > > 		phy-names="pcie-phy";
> > > 		status = "ok";
> > > 	};
> > >
> > >
> > > After some digging, I find if I change the controller's range
> > > property from
> > >
> > > <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x08000000> into
> > > <0x02000000 0x00 0x00400000 0x20 0x00400000 0x00 0x08000000>,
> > >
> > > then dd will success without timeout. IIUC, range here
> > > is only for non-prefetch 32bit mmio, but dd will use dma (maybe cpu
> > > will send cmd to nvme controller via mmio?).
> 
> Generally speaking, an nvme driver notifies the controller of new
> commands via a MMIO write to a specific nvme register. The nvme
> controller fetches those commands from host memory with a DMA.
> 
> One exception to that description is if the nvme controller supports CMB
> with SQEs, but they're not very common. If you had such a controller,
> the driver will use MMIO to write commands directly into controller
> memory instead of letting the controller DMA them from host memory. Do
> you know if you have such a controller?
> 
> The data transfers associated with your 'dd' command will always use DMA.
> 

My nvme is " 05:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 980". From its datasheet, https://s3.ap-northeast-2.amazonaws.com/global.semi.static/Samsung_NVMe_SSD_980_Data_Sheet_Rev.1.1.pdf, it says nothing about CMB/SQEs, so I'm not sure. Is there other ways/tools(like nvme-cli) to query?

> > I don't know how to interpret "ranges".  Can you supply the dmesg and
> > "lspci -vvs 0000:05:00.0" output both ways, e.g.,
> >
> >   pci_bus 0000:00: root bus resource [mem 0x7f800000-0xefffffff window]
> >   pci_bus 0000:00: root bus resource [mem 0xfd000000-0xfe7fffff window]
> >   pci 0000:05:00.0: [vvvv:dddd] type 00 class 0x...
> >   pci 0000:05:00.0: reg 0x10: [mem 0x.....000-0x.....fff ...]
> >
> > > Question:
> > > 1.  Why dd can cause nvme timeout? Is there more debug ways?
> 
> That means the nvme controller didn't provide a response to a posted
> command within the driver's latency tolerance.

FYI, with the help of pci bridger's vendor, they find something interesting: "From catc log, I saw some memory read pkts sent from SSD card, but its memory range is within the memory range of switch down port. So, switch down port will replay UR pkt. It seems not normal." and "Why SSD card send out some memory pkts which memory address is within switch down port's memory range. If so, switch will response UR pkts". I also don't understand how can this happen?


> 
> > > 2. How can this mmio range affect nvme timeout?
> 
> Let's see how those ranges affect what the kernel sees in the pci
> topology, as Bjorn suggested.

Ok, will add details in another mail replaying Bjorn.

> 
> ##############################################################
> ########
> This EXTERNAL email has been scanned by Proofpoint Email Protect service.

Regards,
Li

**********************************************************************
This email and attachments contain Ambarella Proprietary and/or Confidential Information and is intended solely for the use of the individual(s) to whom it is addressed. Any unauthorized review, use, disclosure, distribute, copy, or print is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy all copies of the original message. Thank you.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [EXT] Re: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges
  2021-10-26  3:40     ` [EXT] " Li Chen
@ 2021-10-26  4:15       ` Keith Busch
  2021-10-26  4:44         ` Li Chen
  2021-10-29 10:52         ` Li Chen
  0 siblings, 2 replies; 14+ messages in thread
From: Keith Busch @ 2021-10-26  4:15 UTC (permalink / raw)
  To: Li Chen
  Cc: Bjorn Helgaas, linux-pci, Lorenzo Pieralisi, Rob Herring, kw,
	Bjorn Helgaas, linux-kernel, Tom Joseph, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme

On Tue, Oct 26, 2021 at 03:40:54AM +0000, Li Chen wrote:
> My nvme is " 05:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 980". From its datasheet, https://s3.ap-northeast-2.amazonaws.com/global.semi.static/Samsung_NVMe_SSD_980_Data_Sheet_Rev.1.1.pdf, it says nothing about CMB/SQEs, so I'm not sure. Is there other ways/tools(like nvme-cli) to query?

The driver will export a sysfs property for it if it is supported:

  # cat /sys/class/nvme/nvme0/cmb

If the file doesn't exist, then /dev/nvme0 doesn't have the capability.

> > > I don't know how to interpret "ranges".  Can you supply the dmesg and
> > > "lspci -vvs 0000:05:00.0" output both ways, e.g.,
> > >
> > >   pci_bus 0000:00: root bus resource [mem 0x7f800000-0xefffffff window]
> > >   pci_bus 0000:00: root bus resource [mem 0xfd000000-0xfe7fffff window]
> > >   pci 0000:05:00.0: [vvvv:dddd] type 00 class 0x...
> > >   pci 0000:05:00.0: reg 0x10: [mem 0x.....000-0x.....fff ...]
> > >
> > > > Question:
> > > > 1.  Why dd can cause nvme timeout? Is there more debug ways?
> > 
> > That means the nvme controller didn't provide a response to a posted
> > command within the driver's latency tolerance.
> 
> FYI, with the help of pci bridger's vendor, they find something interesting: "From catc log, I saw some memory read pkts sent from SSD card, but its memory range is within the memory range of switch down port. So, switch down port will replay UR pkt. It seems not normal." and "Why SSD card send out some memory pkts which memory address is within switch down port's memory range. If so, switch will response UR pkts". I also don't understand how can this happen?

I think we can safely assume you're not attempting peer-to-peer, so that
behavior as described shouldn't be happening. It sounds like the memory
windows may be incorrect. The dmesg may help to show if something appears
wrong.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [EXT] Re: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges
  2021-10-25 15:47 ` Bjorn Helgaas
  2021-10-25 16:21   ` Keith Busch
@ 2021-10-26  4:33   ` Li Chen
  1 sibling, 0 replies; 14+ messages in thread
From: Li Chen @ 2021-10-26  4:33 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Lorenzo Pieralisi, Rob Herring, kw, Bjorn Helgaas,
	linux-kernel, Tom Joseph, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme

Hi, Bjorn

> -----Original Message-----
> From: Bjorn Helgaas [mailto:helgaas@kernel.org]
> Sent: Monday, October 25, 2021 11:48 PM
> To: Li Chen
> Cc: linux-pci@vger.kernel.org; Lorenzo Pieralisi; Rob Herring; kw@linux.com;
> Bjorn Helgaas; linux-kernel@vger.kernel.org; Tom Joseph; Keith Busch; Jens
> Axboe; Christoph Hellwig; Sagi Grimberg; linux-nvme@lists.infradead.org
> Subject: [EXT] Re: nvme may get timeout from dd when using different non-
> prefetch mmio outbound/ranges
> 
> [+cc Tom (Cadence maintainer), NVMe folks]
> 
> On Fri, Oct 22, 2021 at 10:08:20AM +0000, Li Chen wrote:
> > Hi, all
> >
> > I found my nvme may get timeout with simply dd, the pcie controller
> > is cadence, and the pcie-controller platform I use is
> > pcie-cadence-host.c, kenrel version is 5.10.32:
> 
> Given that nvme is pretty widely used, I tend to suspect the cadence
> driver or DTS here, but I added both cadence and nvme folks to the CC:
> list.

Thanks!

> 
> > # dd if=/dev/urandom of=/dev/nvme0n1 bs=4k count=1024000
> > [   41.913484][  T274] urandom_read: 2 callbacks suppressed
> > [   41.913490][  T274] random: dd: uninitialized urandom read (4096 bytes read)
> > [   41.926130][  T274] random: dd: uninitialized urandom read (4096 bytes read)
> > [   41.933348][  T274] random: dd: uninitialized urandom read (4096 bytes read)
> > [   47.651842][    C0] random: crng init done
> > [   47.655963][    C0] random: 2 urandom warning(s) missed due to ratelimiting
> > [   81.448128][   T64] nvme nvme0: controller is down; will reset: CSTS=0x3,
> PCI_STATUS=0x2010
> > [   81.481139][    T7] nvme 0000:05:00.0: enabling bus mastering
> > [   81.486946][    T7] nvme 0000:05:00.0: saving config space at offset 0x0
> (reading 0xa809144d)
> > [   81.495517][    T7] nvme 0000:05:00.0: saving config space at offset 0x4
> (reading 0x20100006)
> > [   81.504091][    T7] nvme 0000:05:00.0: saving config space at offset 0x8
> (reading 0x1080200)
> > [   81.512571][    T7] nvme 0000:05:00.0: saving config space at offset 0xc
> (reading 0x0)
> > [   81.520527][    T7] nvme 0000:05:00.0: saving config space at offset 0x10
> (reading 0x8000004)
> > [   81.529094][    T7] nvme 0000:05:00.0: saving config space at offset 0x14
> (reading 0x0)
> > [   81.537138][    T7] nvme 0000:05:00.0: saving config space at offset 0x18
> (reading 0x0)
> > [   81.545186][    T7] nvme 0000:05:00.0: saving config space at offset 0x1c
> (reading 0x0)
> > [   81.553252][    T7] nvme 0000:05:00.0: saving config space at offset 0x20
> (reading 0x0)
> > [   81.561296][    T7] nvme 0000:05:00.0: saving config space at offset 0x24
> (reading 0x0)
> > [   81.569340][    T7] nvme 0000:05:00.0: saving config space at offset 0x28
> (reading 0x0)
> > [   81.577384][    T7] nvme 0000:05:00.0: saving config space at offset 0x2c
> (reading 0xa801144d)
> > [   81.586038][    T7] nvme 0000:05:00.0: saving config space at offset 0x30
> (reading 0x0)
> > [   81.594081][    T7] nvme 0000:05:00.0: saving config space at offset 0x34
> (reading 0x40)
> > [   81.602217][    T7] nvme 0000:05:00.0: saving config space at offset 0x38
> (reading 0x0)
> > [   81.610266][    T7] nvme 0000:05:00.0: saving config space at offset 0x3c
> (reading 0x12c)
> > [   81.634065][    T7] nvme nvme0: Shutdown timeout set to 8 seconds
> > [   81.674332][    T7] nvme nvme0: 1/0/0 default/read/poll queues
> > [  112.168136][  T256] nvme nvme0: I/O 12 QID 1 timeout, disable controller
> > [  112.193145][  T256] blk_update_request: I/O error, dev nvme0n1, sector
> 600656 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> > [  112.205220][  T256] Buffer I/O error on dev nvme0n1, logical block 75082, lost
> async page write
> > [  112.213978][  T256] Buffer I/O error on dev nvme0n1, logical block 75083, lost
> async page write
> > [  112.222727][  T256] Buffer I/O error on dev nvme0n1, logical block 75084, lost
> async page write
> > [  112.231474][  T256] Buffer I/O error on dev nvme0n1, logical block 75085, lost
> async page write
> > [  112.240220][  T256] Buffer I/O error on dev nvme0n1, logical block 75086, lost
> async page write
> > [  112.248966][  T256] Buffer I/O error on dev nvme0n1, logical block 75087, lost
> async page write
> > [  112.257719][  T256] Buffer I/O error on dev nvme0n1, logical block 75088, lost
> async page write
> > [  112.266467][  T256] Buffer I/O error on dev nvme0n1, logical block 75089, lost
> async page write
> > [  112.275213][  T256] Buffer I/O error on dev nvme0n1, logical block 75090, lost
> async page write
> > [  112.283959][  T256] Buffer I/O error on dev nvme0n1, logical block 75091, lost
> async page write
> > [  112.293554][  T256] blk_update_request: I/O error, dev nvme0n1, sector
> 601672 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> > [  112.306559][  T256] blk_update_request: I/O error, dev nvme0n1, sector
> 602688 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> > [  112.319525][  T256] blk_update_request: I/O error, dev nvme0n1, sector
> 603704 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> > [  112.332501][  T256] blk_update_request: I/O error, dev nvme0n1, sector
> 604720 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> > [  112.345466][  T256] blk_update_request: I/O error, dev nvme0n1, sector
> 605736 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> > [  112.358427][  T256] blk_update_request: I/O error, dev nvme0n1, sector
> 606752 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> > [  112.371386][  T256] blk_update_request: I/O error, dev nvme0n1, sector
> 607768 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> > [  112.384346][  T256] blk_update_request: I/O error, dev nvme0n1, sector
> 608784 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> > [  112.397315][  T256] blk_update_request: I/O error, dev nvme0n1, sector
> 609800 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> > [  112.459313][    T7] nvme nvme0: failed to mark controller live state
> > [  112.465758][    T7] nvme nvme0: Removing after probe failure status: -19
> > dd: error writing '/dev/nvme0n1': No space left on device
> > 112200+0 records in
> > 112199+0 records out
> > 459567104 bytes (438.3MB) copied, 70.573266 seconds, 6.2MB/s
> > # [  112.935768][    T7] nvme nvme0: failed to set APST feature (-19)
> >
> >
> >
> >
> > Here is the dts I used:
> > pciec: pcie-controller@2040000000 {
> >                                 compatible = "cdns,cdns-pcie-host";
> > 		device_type = "pci";
> > 		#address-cells = <3>;
> > 		#size-cells = <2>;
> > 		bus-range = <0 5>;
> > 		linux,pci-domain = <0>;
> > 		cdns,no-bar-match-nbits = <38>;
> > 		vendor-id = <0x17cd>;
> > 		device-id = <0x0100>;
> > 		reg-names = "reg", "cfg";
> > 		reg = <0x20 0x40000000 0x0 0x10000000>,
> > 		      <0x20 0x00000000 0x0 0x00001000>;	/* RC only */
> >
> > 		/*
> > 		 * type: 0x00000000 cfg space
> > 		 * type: 0x01000000 IO
> > 		 * type: 0x02000000 32bit mem space No prefetch
> > 		 * type: 0x03000000 64bit mem space No prefetch
> > 		 * type: 0x43000000 64bit mem space prefetch
> > 		 * The First 16MB from BUS_DEV_FUNC=0:0:0 for cfg space
> > 		 * <0x00000000 0x00 0x00000000 0x20 0x00000000 0x00
> 0x01000000>, CFG_SPACE
> > 		*/
> > 		ranges = <0x01000000 0x00 0x00000000 0x20 0x00100000 0x00
> 0x00100000>,
> > 			 <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00
> 0x08000000>;
> >
> > 		#interrupt-cells = <0x1>;
> > 		interrupt-map-mask = <0x00 0x0 0x0 0x7>;
> > 		interrupt-map = <0x0 0x0 0x0 0x1 &gic 0 229 0x4>,
> > 				<0x0 0x0 0x0 0x2 &gic 0 230 0x4>,
> > 				<0x0 0x0 0x0 0x3 &gic 0 231 0x4>,
> > 				<0x0 0x0 0x0 0x4 &gic 0 232 0x4>;
> > 		phys = <&pcie_phy>;
> > 		phy-names="pcie-phy";
> > 		status = "ok";
> > 	};
> >
> >
> > After some digging, I find if I change the controller's range
> > property from
> >
> > <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x08000000> into
> > <0x02000000 0x00 0x00400000 0x20 0x00400000 0x00 0x08000000>,
> >
> > then dd will success without timeout. IIUC, range here
> > is only for non-prefetch 32bit mmio, but dd will use dma (maybe cpu
> > will send cmd to nvme controller via mmio?).
> 
> I don't know how to interpret "ranges".  Can you supply the dmesg and
> "lspci -vvs 0000:05:00.0" output both ways, e.g.,
> 
>   pci_bus 0000:00: root bus resource [mem 0x7f800000-0xefffffff window]
>   pci_bus 0000:00: root bus resource [mem 0xfd000000-0xfe7fffff window]
>   pci 0000:05:00.0: [vvvv:dddd] type 00 class 0x...
>   pci 0000:05:00.0: reg 0x10: [mem 0x.....000-0x.....fff ...]
> 

From the source code:

static int devm_of_pci_get_host_bridge_resources
{
	...
	err = of_pci_range_parser_init(&parser, dev_node);
	...
	for_each_of_pci_range(&parser, &range) {
		/* Read next ranges element */
		if ((range.flags & IORESOURCE_TYPE_BITS) == IORESOURCE_IO)
			range_type = "IO";
		else if ((range.flags & IORESOURCE_TYPE_BITS) == IORESOURCE_MEM)
			range_type = "MEM";
		else
			range_type = "err";
		dev_info(dev, "  %6s %#012llx..%#012llx -> %#012llx\n",
			 range_type, range.cpu_addr,
			 range.cpu_addr + range.size - 1, range.pci_addr);

		/*
		 * If we failed translation or got a zero-sized region
		 * then skip this range
		 */
		if (range.cpu_addr == OF_BAD_ADDR || range.size == 0)
			continue;

		err = of_pci_range_to_resource(&range, dev_node, &tmp_res);
		...
}

So, it simply translates mmio mapping.



Take
ranges = <0x01000000 0x00 0x00000000 0x20 0x00100000 0x00 0x00100000>,
			 <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x04000000>;
for example (this ranges will make nvme timeout):

lspci output:
# lspci -i /usr/share/misc/pci.ids -vvs 0000:05:00.0
05:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 980 (prog-if 02 [NVM Express])
        Subsystem: Samsung Electronics Co Ltd Device a801
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 44
        Region 0: Memory at 2008000000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable- Count=1/32 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s (ok), Width x4 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [b0] MSI-X: Enable- Count=13 Masked-
                Vector table: BAR=0 offset=00003000
                PBA: BAR=0 offset=00002000
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [148 v1] Device Serial Number 00-00-00-00-00-00-00-00
        Capabilities: [158 v1] Power Budgeting <?>
        Capabilities: [168 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Capabilities: [188 v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns
        Capabilities: [190 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us
        Kernel driver in use: nvme
        Kernel modules: nvme

	


/proc/iomem:

01200000-ffffffff : System RAM
  01280000-022affff : Kernel code
  022b0000-0295ffff : reserved
  02960000-040cffff : Kernel data
  05280000-0528ffff : reserved
  41cc0000-422c0fff : reserved
  422c1000-4232afff : reserved
  4232d000-667bbfff : reserved
  667bc000-667bcfff : reserved
  667bd000-667c0fff : reserved
  667c1000-ffffffff : reserved
2000000000-2000000fff : cfg
2008000000-200bffffff : pcie-controller@2040000000
  2008000000-20081fffff : PCI Bus 0000:01
    2008000000-20080fffff : PCI Bus 0000:02
      2008000000-20080fffff : PCI Bus 0000:05
        2008000000-2008003fff : 0000:05:00.0
          2008000000-2008003fff : nvme
    2008100000-200817ffff : 0000:01:00.0
...

From dmesg output:

# dmesg | grep pci
[    0.000000] Kernel command line: console=ttyS0 ubi.mtd=lnx root=ubi0:rootfs rw rootfstype=ubifs init=/linuxrc pci=nomsi pcie_pme=nomsi nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pci=nommconf nvme_core.default_ps_max_latency_us=0
[    2.431732] cdns-pcie 2040000000.pcie-controller: host bridge /pcie-controller@2040000000 ranges:
[    2.442055] cdns-pcie 2040000000.pcie-controller: Parsing ranges property...
[    2.450010] cdns-pcie 2040000000.pcie-controller:       IO 0x2000100000..0x20001fffff -> 0x0000000000
[    2.460215] cdns-pcie 2040000000.pcie-controller:      MEM 0x2008000000..0x200bffffff -> 0x0008000000
[    2.523271] cdns-pcie 2040000000.pcie-controller: Link up
[    2.530647] cdns-pcie 2040000000.pcie-controller: PCI host bridge to bus 0000:00
[    2.538885] pci_bus 0000:00: root bus resource [bus 00-05]
[    2.545194] pci_bus 0000:00: root bus resource [io  0x0000-0xfffff]
[    2.552280] pci_bus 0000:00: root bus resource [mem 0x2008000000-0x200bffffff] (bus address [0x08000000-0x0bffffff])
[    2.563651] pci_bus 0000:00: scanning bus
[    2.568753] pci 0000:00:00.0: [17cd:0100] type 01 class 0x060400
[    2.576196] pci 0000:00:00.0: supports D1
[    2.581006] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
[    2.587667] pci 0000:00:00.0: PME# disabled
[    2.597799] pci_bus 0000:00: fixups for bus
[    2.602800] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 0
[    2.610308] pci 0000:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    2.619160] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 1
[    2.627525] pci_bus 0000:01: scanning bus
[    2.632550] pci 0000:01:00.0: [12d8:c016] type 01 class 0x060400
[    2.639425] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x0007ffff]
[    2.647250] pci 0000:01:00.0: PME# supported from D0 D3hot D3cold
[    2.654165] pci 0000:01:00.0: PME# disabled
[    2.664011] pci_bus 0000:01: fixups for bus
[    2.669012] pci 0000:01:00.0: scanning [bus 00-00] behind bridge, pass 0
[    2.676521] pci 0000:01:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    2.685370] pci 0000:01:00.0: scanning [bus 00-00] behind bridge, pass 1
[    2.693768] pci_bus 0000:02: scanning bus
[    2.699101] pci 0000:02:02.0: [12d8:c016] type 01 class 0x060400
[    2.706740] pci 0000:02:02.0: PME# supported from D0 D3hot D3cold
[    2.713653] pci 0000:02:02.0: PME# disabled
[    2.720417] pci 0000:02:04.0: [12d8:c016] type 01 class 0x060400
[    2.728032] pci 0000:02:04.0: PME# supported from D0 D3hot D3cold
[    2.734945] pci 0000:02:04.0: PME# disabled
[    2.741684] pci 0000:02:06.0: [12d8:c016] type 01 class 0x060400
[    2.749288] pci 0000:02:06.0: PME# supported from D0 D3hot D3cold
[    2.756201] pci 0000:02:06.0: PME# disabled
[    2.765483] pci_bus 0000:02: fixups for bus
[    2.770482] pci 0000:02:02.0: scanning [bus 00-00] behind bridge, pass 0
[    2.777991] pci 0000:02:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    2.786839] pci 0000:02:04.0: scanning [bus 00-00] behind bridge, pass 0
[    2.794337] pci 0000:02:04.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    2.803184] pci 0000:02:06.0: scanning [bus 00-00] behind bridge, pass 0
[    2.810691] pci 0000:02:06.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    2.819539] pci 0000:02:02.0: scanning [bus 00-00] behind bridge, pass 1
[    2.827948] pci_bus 0000:03: scanning bus
[    2.836100] pci_bus 0000:03: fixups for bus
[    2.841085] pci_bus 0000:03: bus scan returning with max=03
[    2.847467] pci_bus 0000:03: busn_res: [bus 03-05] end is updated to 03
[    2.854911] pci 0000:02:04.0: scanning [bus 00-00] behind bridge, pass 1
[    2.863151] pci_bus 0000:04: scanning bus
[    2.871319] pci_bus 0000:04: fixups for bus
[    2.876303] pci_bus 0000:04: bus scan returning with max=04
[    2.882683] pci_bus 0000:04: busn_res: [bus 04-05] end is updated to 04
[    2.890151] pci 0000:02:06.0: scanning [bus 00-00] behind bridge, pass 1
[    2.898383] pci_bus 0000:05: scanning bus
[    2.903362] pci 0000:05:00.0: [144d:a809] type 00 class 0x010802
[    2.910266] pci 0000:05:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
[    2.923188] pci_bus 0000:05: fixups for bus
[    2.928174] pci_bus 0000:05: bus scan returning with max=05
[    2.934553] pci_bus 0000:05: busn_res: [bus 05] end is updated to 05
[    2.941724] pci_bus 0000:02: bus scan returning with max=05
[    2.948101] pci_bus 0000:02: busn_res: [bus 02-05] end is updated to 05
[    2.955557] pci_bus 0000:01: bus scan returning with max=05
[    2.961983] pci_bus 0000:01: busn_res: [bus 01-05] end is updated to 05
[    2.969413] pci_bus 0000:00: bus scan returning with max=05
[    2.975843] pci 0000:00:00.0: BAR 8: assigned [mem 0x2008000000-0x20081fffff]
[    2.983843] pci 0000:01:00.0: BAR 8: assigned [mem 0x2008000000-0x20080fffff]
[    2.991795] pci 0000:01:00.0: BAR 0: assigned [mem 0x2008100000-0x200817ffff]
[    2.999801] pci 0000:02:06.0: BAR 8: assigned [mem 0x2008000000-0x20080fffff]
[    3.007763] pci 0000:02:02.0: PCI bridge to [bus 03]
[    3.013622] pci 0000:02:04.0: PCI bridge to [bus 04]
[    3.019487] pci 0000:05:00.0: BAR 0: assigned [mem 0x2008000000-0x2008003fff 64bit]
[    3.028012] pci 0000:02:06.0: PCI bridge to [bus 05]
[    3.033792] pci 0000:02:06.0:   bridge window [mem 0x2008000000-0x20080fffff]
[    3.041767] pci 0000:01:00.0: PCI bridge to [bus 02-05]
[    3.047808] pci 0000:01:00.0:   bridge window [mem 0x2008000000-0x20080fffff]
[    3.055782] pci 0000:00:00.0: PCI bridge to [bus 01-05]
[    3.061822] pci 0000:00:00.0:   bridge window [mem 0x2008000000-0x20081fffff]
[    4.184289]     pcie_pme=nomsi


Also take
ranges = <0x01000000 0x00 0x00000000 0x20 0x00100000 0x00 0x00100000>,
			 <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x04000000>;
for example (this ranges will not make nvme timeout):

lspci:
# lspci -i /usr/share/misc/pci.ids -vvs 0000:05:00.0
05:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 980 (prog-if 02 [NVM Express])
        Subsystem: Samsung Electronics Co Ltd Device a801
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 44
        Region 0: Memory at 2000400000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable- Count=1/32 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s (ok), Width x4 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [b0] MSI-X: Enable- Count=13 Masked-
                Vector table: BAR=0 offset=00003000
                PBA: BAR=0 offset=00002000
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [148 v1] Device Serial Number 00-00-00-00-00-00-00-00
        Capabilities: [158 v1] Power Budgeting <?>
        Capabilities: [168 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Capabilities: [188 v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns
        Capabilities: [190 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us
        Kernel driver in use: nvme
        Kernel modules: nvme

/proc/iomem:
01200000-ffffffff : System RAM
  01280000-022affff : Kernel code
  022b0000-0295ffff : reserved
  02960000-040cffff : Kernel data
  05280000-0528ffff : reserved
  41cc0000-422c0fff : reserved
  422c1000-4232afff : reserved
  4232d000-667bbfff : reserved
  667bc000-667bcfff : reserved
  667bd000-667c0fff : reserved
  667c1000-ffffffff : reserved
2000000000-2000000fff : cfg
2000400000-20083fffff : pcie-controller@2040000000
  2000400000-20005fffff : PCI Bus 0000:01
    2000400000-20004fffff : PCI Bus 0000:02
      2000400000-20004fffff : PCI Bus 0000:05
        2000400000-2000403fff : 0000:05:00.0
    2000500000-200057ffff : 0000:01:00.0
...

Dmesg:
# dmesg | grep pci
[    0.000000] Kernel command line: console=ttyS0 ubi.mtd=lnx root=ubi0:rootfs rw rootfstype=ubifs init=/linuxrc pci=nomsi pcie_pme=nomsi nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pci=nommconf nvme_core.default_ps_max_latency_us=0
[    2.436357] cdns-pcie 2040000000.pcie-controller: host bridge /pcie-controller@2040000000 ranges:
[    2.446129] cdns-pcie 2040000000.pcie-controller: Parsing ranges property...
[    2.454141] cdns-pcie 2040000000.pcie-controller:       IO 0x2000100000..0x20001fffff -> 0x0000000000
[    2.464350] cdns-pcie 2040000000.pcie-controller:      MEM 0x2000400000..0x20083fffff -> 0x0000400000
[    2.527395] cdns-pcie 2040000000.pcie-controller: Link up
[    2.534745] cdns-pcie 2040000000.pcie-controller: PCI host bridge to bus 0000:00
[    2.543036] pci_bus 0000:00: root bus resource [bus 00-05]
[    2.549345] pci_bus 0000:00: root bus resource [io  0x0000-0xfffff]
[    2.556431] pci_bus 0000:00: root bus resource [mem 0x2000400000-0x20083fffff] (bus address [0x00400000-0x083fffff])
[    2.567802] pci_bus 0000:00: scanning bus
[    2.572823] pci 0000:00:00.0: [17cd:0100] type 01 class 0x060400
[    2.580290] pci 0000:00:00.0: supports D1
[    2.585099] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
[    2.591709] pci 0000:00:00.0: PME# disabled
[    2.601802] pci_bus 0000:00: fixups for bus
[    2.606801] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 0
[    2.614310] pci 0000:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    2.623161] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 1
[    2.631551] pci_bus 0000:01: scanning bus
[    2.636563] pci 0000:01:00.0: [12d8:c016] type 01 class 0x060400
[    2.643460] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x0007ffff]
[    2.651289] pci 0000:01:00.0: PME# supported from D0 D3hot D3cold
[    2.658202] pci 0000:01:00.0: PME# disabled
[    2.667966] pci_bus 0000:01: fixups for bus
[    2.673047] pci 0000:01:00.0: scanning [bus 00-00] behind bridge, pass 0
[    2.680556] pci 0000:01:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    2.689405] pci 0000:01:00.0: scanning [bus 00-00] behind bridge, pass 1
[    2.697804] pci_bus 0000:02: scanning bus
[    2.703079] pci 0000:02:02.0: [12d8:c016] type 01 class 0x060400
[    2.710742] pci 0000:02:02.0: PME# supported from D0 D3hot D3cold
[    2.717657] pci 0000:02:02.0: PME# disabled
[    2.724466] pci 0000:02:04.0: [12d8:c016] type 01 class 0x060400
[    2.732078] pci 0000:02:04.0: PME# supported from D0 D3hot D3cold
[    2.739037] pci 0000:02:04.0: PME# disabled
[    2.745734] pci 0000:02:06.0: [12d8:c016] type 01 class 0x060400
[    2.753344] pci 0000:02:06.0: PME# supported from D0 D3hot D3cold
[    2.760257] pci 0000:02:06.0: PME# disabled
[    2.769510] pci_bus 0000:02: fixups for bus
[    2.774532] pci 0000:02:02.0: scanning [bus 00-00] behind bridge, pass 0
[    2.782043] pci 0000:02:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    2.790892] pci 0000:02:04.0: scanning [bus 00-00] behind bridge, pass 0
[    2.798494] pci 0000:02:04.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    2.807347] pci 0000:02:06.0: scanning [bus 00-00] behind bridge, pass 0
[    2.814855] pci 0000:02:06.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    2.823701] pci 0000:02:02.0: scanning [bus 00-00] behind bridge, pass 1
[    2.832098] pci_bus 0000:03: scanning bus
[    2.840343] pci_bus 0000:03: fixups for bus
[    2.845328] pci_bus 0000:03: bus scan returning with max=03
[    2.851708] pci_bus 0000:03: busn_res: [bus 03-05] end is updated to 03
[    2.859152] pci 0000:02:04.0: scanning [bus 00-00] behind bridge, pass 1
[    2.867404] pci_bus 0000:04: scanning bus
[    2.875551] pci_bus 0000:04: fixups for bus
[    2.880535] pci_bus 0000:04: bus scan returning with max=04
[    2.886914] pci_bus 0000:04: busn_res: [bus 04-05] end is updated to 04
[    2.894359] pci 0000:02:06.0: scanning [bus 00-00] behind bridge, pass 1
[    2.902637] pci_bus 0000:05: scanning bus
[    2.907632] pci 0000:05:00.0: [144d:a809] type 00 class 0x010802
[    2.914536] pci 0000:05:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
[    2.927440] pci_bus 0000:05: fixups for bus
[    2.932426] pci_bus 0000:05: bus scan returning with max=05
[    2.938807] pci_bus 0000:05: busn_res: [bus 05] end is updated to 05
[    2.946025] pci_bus 0000:02: bus scan returning with max=05
[    2.952403] pci_bus 0000:02: busn_res: [bus 02-05] end is updated to 05
[    2.959834] pci_bus 0000:01: bus scan returning with max=05
[    2.966233] pci_bus 0000:01: busn_res: [bus 01-05] end is updated to 05
[    2.973666] pci_bus 0000:00: bus scan returning with max=05
[    2.980097] pci 0000:00:00.0: BAR 8: assigned [mem 0x2000400000-0x20005fffff]
[    2.988097] pci 0000:01:00.0: BAR 8: assigned [mem 0x2000400000-0x20004fffff]
[    2.996046] pci 0000:01:00.0: BAR 0: assigned [mem 0x2000500000-0x200057ffff]
[    3.004092] pci 0000:02:06.0: BAR 8: assigned [mem 0x2000400000-0x20004fffff]
[    3.012055] pci 0000:02:02.0: PCI bridge to [bus 03]
[    3.017867] pci 0000:02:04.0: PCI bridge to [bus 04]
[    3.023709] pci 0000:05:00.0: BAR 0: assigned [mem 0x2000400000-0x2000403fff 64bit]
[    3.032253] pci 0000:02:06.0: PCI bridge to [bus 05]
[    3.038071] pci 0000:02:06.0:   bridge window [mem 0x2000400000-0x20004fffff]
[    3.046086] pci 0000:01:00.0: PCI bridge to [bus 02-05]
[    3.052127] pci 0000:01:00.0:   bridge window [mem 0x2000400000-0x20004fffff]
[    3.060100] pci 0000:00:00.0: PCI bridge to [bus 01-05]
[    3.066140] pci 0000:00:00.0:   bridge window [mem 0x2000400000-0x20005fffff]
[    4.181033]     pcie_pme=nomsi
[   44.419264] nvme nvme0: pci function 0000:05:00.0
[   44.425541] pci 0000:00:00.0: enabling device (0000 -> 0002)
[   44.432217] pci 0000:00:00.0: enabling bus mastering
[   44.438183] pci 0000:01:00.0: enabling device (0000 -> 0002)
[   44.444720] pci 0000:01:00.0: enabling bus mastering
[   44.453373] pci 0000:02:06.0: enabling device (0000 -> 0002)
[   44.459951] pci 0000:02:06.0: enabling bus mastering


[firstlove@sh-lchen tmp]$ diff no-timeout.lspci timeout.lspci 
8c8
<         Region 0: Memory at 2000400000 (64-bit, non-prefetchable) [size=16K]
---
>         Region 0: Memory at 2008000000 (64-bit, non-prefetchable) [size=16K]

> > Question:
> > 1.  Why dd can cause nvme timeout? Is there more debug ways?
> > 2. How can this mmio range affect nvme timeout?
> >
> > Regards,
> > Li
> >
> >
> **************************************************************
> ********
> > This email and attachments contain Ambarella Proprietary and/or Confidential
> Information and is intended solely for the use of the individual(s) to whom it is
> addressed. Any unauthorized review, use, disclosure, distribute, copy, or print is
> prohibited. If you are not an intended recipient, please contact the sender by
> reply email and destroy all copies of the original message. Thank you.
> 
> ##############################################################
> ########
> This EXTERNAL email has been scanned by Proofpoint Email Protect service.

Regards,
Li

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [EXT] Re: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges
  2021-10-26  4:15       ` Keith Busch
@ 2021-10-26  4:44         ` Li Chen
  2021-10-29 10:52         ` Li Chen
  1 sibling, 0 replies; 14+ messages in thread
From: Li Chen @ 2021-10-26  4:44 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bjorn Helgaas, linux-pci, Lorenzo Pieralisi, Rob Herring, kw,
	Bjorn Helgaas, linux-kernel, Tom Joseph, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme

Hi, Keith

> -----Original Message-----
> From: Keith Busch [mailto:kbusch@kernel.org]
> Sent: Tuesday, October 26, 2021 12:16 PM
> To: Li Chen
> Cc: Bjorn Helgaas; linux-pci@vger.kernel.org; Lorenzo Pieralisi; Rob Herring;
> kw@linux.com; Bjorn Helgaas; linux-kernel@vger.kernel.org; Tom Joseph; Jens
> Axboe; Christoph Hellwig; Sagi Grimberg; linux-nvme@lists.infradead.org
> Subject: Re: [EXT] Re: nvme may get timeout from dd when using different non-
> prefetch mmio outbound/ranges
> 
> On Tue, Oct 26, 2021 at 03:40:54AM +0000, Li Chen wrote:
> > My nvme is " 05:00.0 Non-Volatile memory controller: Samsung Electronics Co
> Ltd NVMe SSD Controller 980". From its datasheet,
> https://urldefense.com/v3/__https://s3.ap-northeast-
> 2.amazonaws.com/global.semi.static/Samsung_NVMe_SSD_980_Data_Sheet_R
> ev.1.1.pdf__;!!PeEy7nZLVv0!3MU3LdTWuzON9JMUkq29zwJM4d7g7wKtkiZszTu-
> PVepWchI_uLHpQGgdR_LEZM$ , it says nothing about CMB/SQEs, so I'm not sure.
> Is there other ways/tools(like nvme-cli) to query?
> 
> The driver will export a sysfs property for it if it is supported:
> 
>   # cat /sys/class/nvme/nvme0/cmb
> 
> If the file doesn't exist, then /dev/nvme0 doesn't have the capability.
> 

# ls /sys/class/nvme/nvme0/
address            model              rescan_controller  subsysnqn
cntlid             numa_node          reset_controller   subsystem
dev                nvme0n1            serial             transport
device             power              sqsize             uevent
firmware_rev       queue_count        state

my nvme doesn't cmb.


> > > > I don't know how to interpret "ranges".  Can you supply the dmesg and
> > > > "lspci -vvs 0000:05:00.0" output both ways, e.g.,
> > > >
> > > >   pci_bus 0000:00: root bus resource [mem 0x7f800000-0xefffffff window]
> > > >   pci_bus 0000:00: root bus resource [mem 0xfd000000-0xfe7fffff window]
> > > >   pci 0000:05:00.0: [vvvv:dddd] type 00 class 0x...
> > > >   pci 0000:05:00.0: reg 0x10: [mem 0x.....000-0x.....fff ...]
> > > >
> > > > > Question:
> > > > > 1.  Why dd can cause nvme timeout? Is there more debug ways?
> > >
> > > That means the nvme controller didn't provide a response to a posted
> > > command within the driver's latency tolerance.
> >
> > FYI, with the help of pci bridger's vendor, they find something interesting:
> "From catc log, I saw some memory read pkts sent from SSD card, but its memory
> range is within the memory range of switch down port. So, switch down port will
> replay UR pkt. It seems not normal." and "Why SSD card send out some memory
> pkts which memory address is within switch down port's memory range. If so,
> switch will response UR pkts". I also don't understand how can this happen?
> 
> I think we can safely assume you're not attempting peer-to-peer, so that
> behavior as described shouldn't be happening. It sounds like the memory
> windows may be incorrect. The dmesg may help to show if something appears
> wrong.

Dmesg has been pasted in https://marc.info/?l=linux-pci&m=163522281024680&w=3

Yes, peer-to-peer cannot happen, nvme is my only ep:

# lspci -i /usr/share/misc/pci.ids
00:00.0 PCI bridge: Cadence Design Systems, Inc. Device 0100
01:00.0 PCI bridge: Pericom Semiconductor Device c016 (rev 07)
02:02.0 PCI bridge: Pericom Semiconductor Device c016 (rev 06)
02:04.0 PCI bridge: Pericom Semiconductor Device c016 (rev 06)
02:06.0 PCI bridge: Pericom Semiconductor Device c016 (rev 06)
05:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 980

**********************************************************************
This email and attachments contain Ambarella Proprietary and/or Confidential Information and is intended solely for the use of the individual(s) to whom it is addressed. Any unauthorized review, use, disclosure, distribute, copy, or print is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy all copies of the original message. Thank you.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [EXT] Re: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges
  2021-10-26  4:15       ` Keith Busch
  2021-10-26  4:44         ` Li Chen
@ 2021-10-29 10:52         ` Li Chen
  2021-10-29 19:42           ` Bjorn Helgaas
  1 sibling, 1 reply; 14+ messages in thread
From: Li Chen @ 2021-10-29 10:52 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bjorn Helgaas, linux-pci, Lorenzo Pieralisi, Rob Herring, kw,
	Bjorn Helgaas, linux-kernel, Tom Joseph, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme

> -----Original Message-----
> From: Keith Busch [mailto:kbusch@kernel.org]
> Sent: Tuesday, October 26, 2021 12:16 PM
> To: Li Chen
> Cc: Bjorn Helgaas; linux-pci@vger.kernel.org; Lorenzo Pieralisi; Rob Herring;
> kw@linux.com; Bjorn Helgaas; linux-kernel@vger.kernel.org; Tom Joseph; Jens
> Axboe; Christoph Hellwig; Sagi Grimberg; linux-nvme@lists.infradead.org
> Subject: Re: [EXT] Re: nvme may get timeout from dd when using different non-
> prefetch mmio outbound/ranges
> 
> On Tue, Oct 26, 2021 at 03:40:54AM +0000, Li Chen wrote:
> > My nvme is " 05:00.0 Non-Volatile memory controller: Samsung Electronics Co
> Ltd NVMe SSD Controller 980". From its datasheet,
> https://urldefense.com/v3/__https://s3.ap-northeast-
> 2.amazonaws.com/global.semi.static/Samsung_NVMe_SSD_980_Data_Sheet_R
> ev.1.1.pdf__;!!PeEy7nZLVv0!3MU3LdTWuzON9JMUkq29zwJM4d7g7wKtkiZszTu-
> PVepWchI_uLHpQGgdR_LEZM$ , it says nothing about CMB/SQEs, so I'm not sure.
> Is there other ways/tools(like nvme-cli) to query?
> 
> The driver will export a sysfs property for it if it is supported:
> 
>   # cat /sys/class/nvme/nvme0/cmb
> 
> If the file doesn't exist, then /dev/nvme0 doesn't have the capability.
> 
> > > > I don't know how to interpret "ranges".  Can you supply the dmesg and
> > > > "lspci -vvs 0000:05:00.0" output both ways, e.g.,
> > > >
> > > >   pci_bus 0000:00: root bus resource [mem 0x7f800000-0xefffffff window]
> > > >   pci_bus 0000:00: root bus resource [mem 0xfd000000-0xfe7fffff window]
> > > >   pci 0000:05:00.0: [vvvv:dddd] type 00 class 0x...
> > > >   pci 0000:05:00.0: reg 0x10: [mem 0x.....000-0x.....fff ...]
> > > >
> > > > > Question:
> > > > > 1.  Why dd can cause nvme timeout? Is there more debug ways?
> > >
> > > That means the nvme controller didn't provide a response to a posted
> > > command within the driver's latency tolerance.
> >
> > FYI, with the help of pci bridger's vendor, they find something interesting:
> "From catc log, I saw some memory read pkts sent from SSD card, but its memory
> range is within the memory range of switch down port. So, switch down port will
> replay UR pkt. It seems not normal." and "Why SSD card send out some memory
> pkts which memory address is within switch down port's memory range. If so,
> switch will response UR pkts". I also don't understand how can this happen?
> 
> I think we can safely assume you're not attempting peer-to-peer, so that
> behavior as described shouldn't be happening. It sounds like the memory
> windows may be incorrect. The dmesg may help to show if something appears
> wrong.

Hi, Keith

Agree that here doesn't involve peer-to-peer DMA. After conforming from switch vendor today, the two ur(unsupported request) is because nvme is trying to dma read dram with bus address 80d5000 and 80d5100. But the two bus addresses are located in switch's down port range, so the switch down port report ur. 

In our soc, dma/bus/pci address and physical/AXI address are 1:1, and DRAM space in physical memory address space is 000000.0000 - 0fffff.ffff 64G, so bus address 80d5000 and 80d5100 to cpu address are also 80d5000 and 80d5100, which both located inside dram space. 

Both our bootloader and romcode don't enum and configure pcie devices and switches, so the switch cfg stage should be left to kernel. 

Come back to the subject of this thread: " nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges". I found:

1. For <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x04000000>;
(which will timeout nvme)

Switch(bridge of nvme)'s resource window: 
Memory behind bridge: Memory behind bridge: 08000000-080fffff [size=1M]

80d5000 and 80d5100 are both inside this range.

2. For <0x02000000 0x00 0x00400000 0x20 0x00400000 0x00 0x08000000>; 
(which make nvme not timeout) 

Switch(bridge of nvme)'s resource window: 
Memory behind bridge: Memory behind bridge: 00400000-004fffff [size=1M]

80d5000 and 80d5100 are not inside this range, so if nvme tries to read 80d5000 and 80d5100 , ur won't happe.


From /proc/iomen:
# cat /proc/iomem
01200000-ffffffff : System RAM
  01280000-022affff : Kernel code
  022b0000-0295ffff : reserved
  02960000-040cffff : Kernel data
  05280000-0528ffff : reserved
  41cc0000-422c0fff : reserved
  422c1000-4232afff : reserved
  4232d000-667bbfff : reserved
  667bc000-667bcfff : reserved
  667bd000-667c0fff : reserved
  667c1000-ffffffff : reserved
2000000000-2000000fff : cfg

No one uses 0000000-1200000, so " Memory behind bridge: Memory behind bridge: 00400000-004fffff [size=1M]" will never have any problem(because 0x1200000 > 0x004fffff). 


Above answers the question in Subject, one question left: what's the right way to resolve this problem? Use ranges property to configure switch memory window indirectly(just what I did)? Or something else?

I don't think changing range property is the right way: If my PCIe topology becomes more complex and have more endpoints or switches, maybe I have to reserve more MMIO through range property(please correct me if I'm wrong), the end of switch's memory window may be larger than 0x01200000. In case getting ur again,  I must reserve more physical memory address for them(like change kernel start address 0x01200000 to 0x02000000), which will make my visible dram smaller(I have verified it with "free -m"), it is not acceptable.


So, is there any better solution?

Regards,
Li 

**********************************************************************
This email and attachments contain Ambarella Proprietary and/or Confidential Information and is intended solely for the use of the individual(s) to whom it is addressed. Any unauthorized review, use, disclosure, distribute, copy, or print is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy all copies of the original message. Thank you.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [EXT] Re: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges
  2021-10-29 10:52         ` Li Chen
@ 2021-10-29 19:42           ` Bjorn Helgaas
  2021-11-02  5:18             ` Li Chen
  0 siblings, 1 reply; 14+ messages in thread
From: Bjorn Helgaas @ 2021-10-29 19:42 UTC (permalink / raw)
  To: Li Chen
  Cc: Keith Busch, linux-pci, Lorenzo Pieralisi, Rob Herring, kw,
	Bjorn Helgaas, linux-kernel, Tom Joseph, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme

On Fri, Oct 29, 2021 at 10:52:37AM +0000, Li Chen wrote:
> > -----Original Message-----
> > From: Keith Busch [mailto:kbusch@kernel.org]
> > Sent: Tuesday, October 26, 2021 12:16 PM
> > To: Li Chen
> > Cc: Bjorn Helgaas; linux-pci@vger.kernel.org; Lorenzo Pieralisi; Rob Herring;
> > kw@linux.com; Bjorn Helgaas; linux-kernel@vger.kernel.org; Tom Joseph; Jens
> > Axboe; Christoph Hellwig; Sagi Grimberg; linux-nvme@lists.infradead.org
> > Subject: Re: [EXT] Re: nvme may get timeout from dd when using different non-
> > prefetch mmio outbound/ranges
> > 
> > On Tue, Oct 26, 2021 at 03:40:54AM +0000, Li Chen wrote:
> > > My nvme is " 05:00.0 Non-Volatile memory controller: Samsung Electronics Co
> > Ltd NVMe SSD Controller 980". From its datasheet,
> > https://urldefense.com/v3/__https://s3.ap-northeast-
> > 2.amazonaws.com/global.semi.static/Samsung_NVMe_SSD_980_Data_Sheet_R
> > ev.1.1.pdf__;!!PeEy7nZLVv0!3MU3LdTWuzON9JMUkq29zwJM4d7g7wKtkiZszTu-
> > PVepWchI_uLHpQGgdR_LEZM$ , it says nothing about CMB/SQEs, so I'm not sure.
> > Is there other ways/tools(like nvme-cli) to query?
> > 
> > The driver will export a sysfs property for it if it is supported:
> > 
> >   # cat /sys/class/nvme/nvme0/cmb
> > 
> > If the file doesn't exist, then /dev/nvme0 doesn't have the capability.
> > 
> > > > > I don't know how to interpret "ranges".  Can you supply the dmesg and
> > > > > "lspci -vvs 0000:05:00.0" output both ways, e.g.,
> > > > >
> > > > >   pci_bus 0000:00: root bus resource [mem 0x7f800000-0xefffffff window]
> > > > >   pci_bus 0000:00: root bus resource [mem 0xfd000000-0xfe7fffff window]
> > > > >   pci 0000:05:00.0: [vvvv:dddd] type 00 class 0x...
> > > > >   pci 0000:05:00.0: reg 0x10: [mem 0x.....000-0x.....fff ...]
> > > > >
> > > > > > Question:
> > > > > > 1.  Why dd can cause nvme timeout? Is there more debug ways?
> > > >
> > > > That means the nvme controller didn't provide a response to a posted
> > > > command within the driver's latency tolerance.
> > >
> > > FYI, with the help of pci bridger's vendor, they find something
> > > interesting:
> > "From catc log, I saw some memory read pkts sent from SSD card,
> > but its memory range is within the memory range of switch down
> > port. So, switch down port will replay UR pkt. It seems not
> > normal." and "Why SSD card send out some memory pkts which memory
> > address is within switch down port's memory range. If so, switch
> > will response UR pkts". I also don't understand how can this
> > happen?
> > 
> > I think we can safely assume you're not attempting peer-to-peer,
> > so that behavior as described shouldn't be happening. It sounds
> > like the memory windows may be incorrect. The dmesg may help to
> > show if something appears wrong.
> 
> Agree that here doesn't involve peer-to-peer DMA. After conforming
> from switch vendor today, the two ur(unsupported request) is because
> nvme is trying to dma read dram with bus address 80d5000 and
> 80d5100. But the two bus addresses are located in switch's down port
> range, so the switch down port report ur. 
> 
> In our soc, dma/bus/pci address and physical/AXI address are 1:1,
> and DRAM space in physical memory address space is 000000.0000 -
> 0fffff.ffff 64G, so bus address 80d5000 and 80d5100 to cpu address
> are also 80d5000 and 80d5100, which both located inside dram space. 
> 
> Both our bootloader and romcode don't enum and configure pcie
> devices and switches, so the switch cfg stage should be left to
> kernel. 
> 
> Come back to the subject of this thread: " nvme may get timeout from
> dd when using different non-prefetch mmio outbound/ranges". I found:
> 
> 1. For <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x04000000>;
> (which will timeout nvme)
> 
> Switch(bridge of nvme)'s resource window: 
> Memory behind bridge: Memory behind bridge: 08000000-080fffff [size=1M]
>
> 80d5000 and 80d5100 are both inside this range.

The PCI host bridge MMIO window is here:

  pci_bus 0000:00: root bus resource [mem 0x2008000000-0x200bffffff] (bus address [0x08000000-0x0bffffff])
  pci 0000:01:00.0: PCI bridge to [bus 02-05]
  pci 0000:01:00.0:   bridge window [mem 0x2008000000-0x20080fffff]
  pci 0000:02:06.0: PCI bridge to [bus 05]
  pci 0000:02:06.0:   bridge window [mem 0x2008000000-0x20080fffff]
  pci 0000:05:00.0: BAR 0: assigned [mem 0x2008000000-0x2008003fff 64bit]

So bus address [0x08000000-0x0bffffff] is the area used for PCI BARs.
If the NVMe device is generating DMA transactions to 0x080d5000, which
is inside that range, those will be interpreted as peer-to-peer
transactions.  But obviously that's not intended and there's no device
at 0x080d5000 anyway.

My guess is the nvme driver got 0x080d5000 from the DMA API, e.g.,
dma_map_bvec() or dma_map_sg_attrs(), so maybe there's something wrong
in how that's set up.  Is there an IOMMU?  There should be arch code
that knows what RAM is available for DMA buffers, maybe based on the
DT.  I'm not really familiar with how all that would be arranged, but
the complete dmesg log and complete DT might have a clue.  Can you
post those somewhere?

> 2. For <0x02000000 0x00 0x00400000 0x20 0x00400000 0x00 0x08000000>; 
> (which make nvme not timeout) 
> 
> Switch(bridge of nvme)'s resource window: 
> Memory behind bridge: Memory behind bridge: 00400000-004fffff [size=1M]
> 
> 80d5000 and 80d5100 are not inside this range, so if nvme tries to
> read 80d5000 and 80d5100 , ur won't happen.
>
> From /proc/iomen:
> # cat /proc/iomem
> 01200000-ffffffff : System RAM
>   01280000-022affff : Kernel code
>   022b0000-0295ffff : reserved
>   02960000-040cffff : Kernel data
>   05280000-0528ffff : reserved
>   41cc0000-422c0fff : reserved
>   422c1000-4232afff : reserved
>   4232d000-667bbfff : reserved
>   667bc000-667bcfff : reserved
>   667bd000-667c0fff : reserved
>   667c1000-ffffffff : reserved
> 2000000000-2000000fff : cfg
> 
> No one uses 0000000-1200000, so " Memory behind bridge: Memory
> behind bridge: 00400000-004fffff [size=1M]" will never have any
> problem(because 0x1200000 > 0x004fffff). 
> 
> Above answers the question in Subject, one question left: what's the
> right way to resolve this problem? Use ranges property to configure
> switch memory window indirectly(just what I did)? Or something else?
> 
> I don't think changing range property is the right way: If my PCIe
> topology becomes more complex and have more endpoints or switches,
> maybe I have to reserve more MMIO through range property(please
> correct me if I'm wrong), the end of switch's memory window may be
> larger than 0x01200000. In case getting ur again,  I must reserve
> more physical memory address for them(like change kernel start
> address 0x01200000 to 0x02000000), which will make my visible dram
> smaller(I have verified it with "free -m"), it is not acceptable.

Right, I don't think changing the PCI ranges property is the right
answer.  I think it's just a coincidence that moving the host bridge
MMIO aperture happens to move it out of the way of the DMA to
0x080d5000.

As far as I can tell, the PCI core and the nvme driver are doing the
right things here, and the problem is something behind the DMA API.

I think there should be something that removes the MMIO aperture bus
addresses, i.e., 0x08000000-0x0bffffff in the timeout case, from the
pool of memory available for DMA buffers.

The MMIO aperture bus addresses in the non-timeout case,
0x00400000-0x083fffff, are not included in the 0x01200000-0xffffffff
System RAM area, which would explain why a DMA buffer would never
overlap with it.

Bjorn

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [EXT] Re: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges
  2021-10-29 19:42           ` Bjorn Helgaas
@ 2021-11-02  5:18             ` Li Chen
  2021-11-02  7:12               ` Li Chen
  0 siblings, 1 reply; 14+ messages in thread
From: Li Chen @ 2021-11-02  5:18 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Keith Busch, linux-pci, Lorenzo Pieralisi, Rob Herring, kw,
	Bjorn Helgaas, linux-kernel, Tom Joseph, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme

Hi, Bjorn

> -----Original Message-----
> From: Bjorn Helgaas [mailto:helgaas@kernel.org]
> Sent: Saturday, October 30, 2021 3:43 AM
> To: Li Chen
> Cc: Keith Busch; linux-pci@vger.kernel.org; Lorenzo Pieralisi; Rob Herring;
> kw@linux.com; Bjorn Helgaas; linux-kernel@vger.kernel.org; Tom Joseph; Jens
> Axboe; Christoph Hellwig; Sagi Grimberg; linux-nvme@lists.infradead.org
> Subject: Re: [EXT] Re: nvme may get timeout from dd when using different non-
> prefetch mmio outbound/ranges
> 
> On Fri, Oct 29, 2021 at 10:52:37AM +0000, Li Chen wrote:
> > > -----Original Message-----
> > > From: Keith Busch [mailto:kbusch@kernel.org]
> > > Sent: Tuesday, October 26, 2021 12:16 PM
> > > To: Li Chen
> > > Cc: Bjorn Helgaas; linux-pci@vger.kernel.org; Lorenzo Pieralisi; Rob Herring;
> > > kw@linux.com; Bjorn Helgaas; linux-kernel@vger.kernel.org; Tom Joseph;
> Jens
> > > Axboe; Christoph Hellwig; Sagi Grimberg; linux-nvme@lists.infradead.org
> > > Subject: Re: [EXT] Re: nvme may get timeout from dd when using different
> non-
> > > prefetch mmio outbound/ranges
> > >
> > > On Tue, Oct 26, 2021 at 03:40:54AM +0000, Li Chen wrote:
> > > > My nvme is " 05:00.0 Non-Volatile memory controller: Samsung Electronics
> Co
> > > Ltd NVMe SSD Controller 980". From its datasheet,
> > > https://urldefense.com/v3/__https://s3.ap-northeast-
> > >
> 2.amazonaws.com/global.semi.static/Samsung_NVMe_SSD_980_Data_Sheet_R
> > >
> ev.1.1.pdf__;!!PeEy7nZLVv0!3MU3LdTWuzON9JMUkq29zwJM4d7g7wKtkiZszTu-
> > > PVepWchI_uLHpQGgdR_LEZM$ , it says nothing about CMB/SQEs, so I'm not
> sure.
> > > Is there other ways/tools(like nvme-cli) to query?
> > >
> > > The driver will export a sysfs property for it if it is supported:
> > >
> > >   # cat /sys/class/nvme/nvme0/cmb
> > >
> > > If the file doesn't exist, then /dev/nvme0 doesn't have the capability.
> > >
> > > > > > I don't know how to interpret "ranges".  Can you supply the dmesg and
> > > > > > "lspci -vvs 0000:05:00.0" output both ways, e.g.,
> > > > > >
> > > > > >   pci_bus 0000:00: root bus resource [mem 0x7f800000-0xefffffff
> window]
> > > > > >   pci_bus 0000:00: root bus resource [mem 0xfd000000-0xfe7fffff
> window]
> > > > > >   pci 0000:05:00.0: [vvvv:dddd] type 00 class 0x...
> > > > > >   pci 0000:05:00.0: reg 0x10: [mem 0x.....000-0x.....fff ...]
> > > > > >
> > > > > > > Question:
> > > > > > > 1.  Why dd can cause nvme timeout? Is there more debug ways?
> > > > >
> > > > > That means the nvme controller didn't provide a response to a posted
> > > > > command within the driver's latency tolerance.
> > > >
> > > > FYI, with the help of pci bridger's vendor, they find something
> > > > interesting:
> > > "From catc log, I saw some memory read pkts sent from SSD card,
> > > but its memory range is within the memory range of switch down
> > > port. So, switch down port will replay UR pkt. It seems not
> > > normal." and "Why SSD card send out some memory pkts which memory
> > > address is within switch down port's memory range. If so, switch
> > > will response UR pkts". I also don't understand how can this
> > > happen?
> > >
> > > I think we can safely assume you're not attempting peer-to-peer,
> > > so that behavior as described shouldn't be happening. It sounds
> > > like the memory windows may be incorrect. The dmesg may help to
> > > show if something appears wrong.
> >
> > Agree that here doesn't involve peer-to-peer DMA. After conforming
> > from switch vendor today, the two ur(unsupported request) is because
> > nvme is trying to dma read dram with bus address 80d5000 and
> > 80d5100. But the two bus addresses are located in switch's down port
> > range, so the switch down port report ur.
> >
> > In our soc, dma/bus/pci address and physical/AXI address are 1:1,
> > and DRAM space in physical memory address space is 000000.0000 -
> > 0fffff.ffff 64G, so bus address 80d5000 and 80d5100 to cpu address
> > are also 80d5000 and 80d5100, which both located inside dram space.
> >
> > Both our bootloader and romcode don't enum and configure pcie
> > devices and switches, so the switch cfg stage should be left to
> > kernel.
> >
> > Come back to the subject of this thread: " nvme may get timeout from
> > dd when using different non-prefetch mmio outbound/ranges". I found:
> >
> > 1. For <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x04000000>;
> > (which will timeout nvme)
> >
> > Switch(bridge of nvme)'s resource window:
> > Memory behind bridge: Memory behind bridge: 08000000-080fffff [size=1M]
> >
> > 80d5000 and 80d5100 are both inside this range.
> 
> The PCI host bridge MMIO window is here:
> 
>   pci_bus 0000:00: root bus resource [mem 0x2008000000-0x200bffffff] (bus
> address [0x08000000-0x0bffffff])
>   pci 0000:01:00.0: PCI bridge to [bus 02-05]
>   pci 0000:01:00.0:   bridge window [mem 0x2008000000-0x20080fffff]
>   pci 0000:02:06.0: PCI bridge to [bus 05]
>   pci 0000:02:06.0:   bridge window [mem 0x2008000000-0x20080fffff]
>   pci 0000:05:00.0: BAR 0: assigned [mem 0x2008000000-0x2008003fff 64bit]
> 
> So bus address [0x08000000-0x0bffffff] is the area used for PCI BARs.
> If the NVMe device is generating DMA transactions to 0x080d5000, which
> is inside that range, those will be interpreted as peer-to-peer
> transactions.  But obviously that's not intended and there's no device
> at 0x080d5000 anyway.
> 
> My guess is the nvme driver got 0x080d5000 from the DMA API, e.g.,
> dma_map_bvec() or dma_map_sg_attrs(), so maybe there's something wrong
> in how that's set up.  Is there an IOMMU?  There should be arch code
> that knows what RAM is available for DMA buffers, maybe based on the
> DT.  I'm not really familiar with how all that would be arranged, but
> the complete dmesg log and complete DT might have a clue.  Can you
> post those somewhere?

After some printk, I found nvme_pci_setup_prps get some dma addresses inside switch's memory window from sg, but I don't where the sg is from(see comments in following source codes):

static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
		struct request *req, struct nvme_rw_command *cmnd)
{
	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
	struct dma_pool *pool;
	int length = blk_rq_payload_bytes(req);
	struct scatterlist *sg = iod->sg;
	int dma_len = sg_dma_len(sg);
	u64 dma_addr = sg_dma_address(sg);
                ......
	for (;;) {
		if (i == NVME_CTRL_PAGE_SIZE >> 3) {
			__le64 *old_prp_list = prp_list;
			prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
			printk("lchen %s %d dma pool %llx", __func__, __LINE__, prp_dma);
			if (!prp_list)
				goto free_prps;
			list[iod->npages++] = prp_list;
			prp_list[0] = old_prp_list[i - 1];
			old_prp_list[i - 1] = cpu_to_le64(prp_dma);
			i = 1;
		}
		prp_list[i++] = cpu_to_le64(dma_addr);
		dma_len -= NVME_CTRL_PAGE_SIZE;
		dma_addr += NVME_CTRL_PAGE_SIZE;
		length -= NVME_CTRL_PAGE_SIZE;
		if (length <= 0)
			break;
		if (dma_len > 0)
			continue;
		if (unlikely(dma_len < 0))
			goto bad_sgl;
		sg = sg_next(sg);
		dma_addr = sg_dma_address(sg);
		dma_len = sg_dma_len(sg);


		// XXX: Here get the following output, the region is inside bridge's window 08000000-080fffff [size=1M]
		/*

# dmesg | grep " 80" | grep -v "  80"
[    0.000476] Console: colour dummy device 80x25
[   79.331766] lchen dma nvme_pci_setup_prps 708 addr 80ba000, end addr 80bc000
[   79.815469] lchen dma nvme_pci_setup_prps 708 addr 8088000, end addr 8090000
[  111.562129] lchen dma nvme_pci_setup_prps 708 addr 8088000, end addr 8090000
[  111.873690] lchen dma nvme_pci_setup_prps 708 addr 80ba000, end addr 80bc000
			 * * */
		printk("lchen dma %s %d addr %llx, end addr %llx", __func__, __LINE__, dma_addr, dma_addr + dma_len);
	}
done:
	cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
	cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
	return BLK_STS_OK;
free_prps:
	nvme_free_prps(dev, req);
	return BLK_STS_RESOURCE;
bad_sgl:
	WARN(DO_ONCE(nvme_print_sgl, iod->sg, iod->nents),
			"Invalid SGL for payload:%d nents:%d\n",
			blk_rq_payload_bytes(req), iod->nents);
	return BLK_STS_IOERR;
}

Backtrace of this function:
# entries-in-buffer/entries-written: 1574/1574   #P:2
#
#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| /     delay
#           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
#              | |         |   ||||      |         |
    kworker/u4:0-7       [000] ...1    40.095494: nvme_queue_rq <-blk_mq_dispatch_rq_list
    kworker/u4:0-7       [000] ...1    40.095503: <stack trace>
 => nvme_queue_rq
 => blk_mq_dispatch_rq_list
 => __blk_mq_do_dispatch_sched
 => __blk_mq_sched_dispatch_requests
 => blk_mq_sched_dispatch_requests
 => __blk_mq_run_hw_queue
 => __blk_mq_delay_run_hw_queue
 => blk_mq_run_hw_queue
 => blk_mq_sched_insert_requests
 => blk_mq_flush_plug_list
 => blk_flush_plug_list
 => blk_mq_submit_bio
 => __submit_bio_noacct_mq
 => submit_bio_noacct
 => submit_bio
 => submit_bh_wbc.constprop.0
 => __block_write_full_page
 => block_write_full_page
 => blkdev_writepage
 => __writepage
 => write_cache_pages
 => generic_writepages
 => blkdev_writepages
 => do_writepages
 => __writeback_single_inode
 => writeback_sb_inodes
 => __writeback_inodes_wb
 => wb_writeback
 => wb_do_writeback
 => wb_workfn
 => process_one_work
 => worker_thread
 => kthread
 => ret_from_fork


We don't have IOMMU and just have 1:1 mapping dma outbound. 


Here is the whole dmesg output(without my debug log): https://paste.debian.net/1217721/
Here is our dtsi: https://paste.debian.net/1217723/
> 
> > 2. For <0x02000000 0x00 0x00400000 0x20 0x00400000 0x00 0x08000000>;
> > (which make nvme not timeout)
> >
> > Switch(bridge of nvme)'s resource window:
> > Memory behind bridge: Memory behind bridge: 00400000-004fffff [size=1M]
> >
> > 80d5000 and 80d5100 are not inside this range, so if nvme tries to
> > read 80d5000 and 80d5100 , ur won't happen.
> >
> > From /proc/iomen:
> > # cat /proc/iomem
> > 01200000-ffffffff : System RAM
> >   01280000-022affff : Kernel code
> >   022b0000-0295ffff : reserved
> >   02960000-040cffff : Kernel data
> >   05280000-0528ffff : reserved
> >   41cc0000-422c0fff : reserved
> >   422c1000-4232afff : reserved
> >   4232d000-667bbfff : reserved
> >   667bc000-667bcfff : reserved
> >   667bd000-667c0fff : reserved
> >   667c1000-ffffffff : reserved
> > 2000000000-2000000fff : cfg
> >
> > No one uses 0000000-1200000, so " Memory behind bridge: Memory
> > behind bridge: 00400000-004fffff [size=1M]" will never have any
> > problem(because 0x1200000 > 0x004fffff).
> >
> > Above answers the question in Subject, one question left: what's the
> > right way to resolve this problem? Use ranges property to configure
> > switch memory window indirectly(just what I did)? Or something else?
> >
> > I don't think changing range property is the right way: If my PCIe
> > topology becomes more complex and have more endpoints or switches,
> > maybe I have to reserve more MMIO through range property(please
> > correct me if I'm wrong), the end of switch's memory window may be
> > larger than 0x01200000. In case getting ur again,  I must reserve
> > more physical memory address for them(like change kernel start
> > address 0x01200000 to 0x02000000), which will make my visible dram
> > smaller(I have verified it with "free -m"), it is not acceptable.
> 
> Right, I don't think changing the PCI ranges property is the right
> answer.  I think it's just a coincidence that moving the host bridge
> MMIO aperture happens to move it out of the way of the DMA to
> 0x080d5000.
> 
> As far as I can tell, the PCI core and the nvme driver are doing the
> right things here, and the problem is something behind the DMA API.
> 
> I think there should be something that removes the MMIO aperture bus
> addresses, i.e., 0x08000000-0x0bffffff in the timeout case, from the
> pool of memory available for DMA buffers.
> 
> The MMIO aperture bus addresses in the non-timeout case,
> 0x00400000-0x083fffff, are not included in the 0x01200000-0xffffffff
> System RAM area, which would explain why a DMA buffer would never
> overlap with it.
> 
> Bjorn

Regards,
Li

**********************************************************************
This email and attachments contain Ambarella Proprietary and/or Confidential Information and is intended solely for the use of the individual(s) to whom it is addressed. Any unauthorized review, use, disclosure, distribute, copy, or print is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy all copies of the original message. Thank you.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [EXT] Re: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges
  2021-11-02  5:18             ` Li Chen
@ 2021-11-02  7:12               ` Li Chen
  2021-11-03 10:04                 ` Li Chen
  2021-11-29  3:04                 ` Li Chen
  0 siblings, 2 replies; 14+ messages in thread
From: Li Chen @ 2021-11-02  7:12 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Keith Busch, linux-pci, Lorenzo Pieralisi, Rob Herring, kw,
	Bjorn Helgaas, linux-kernel, Tom Joseph, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme



> -----Original Message-----
> From: Li Chen
> Sent: Tuesday, November 2, 2021 1:18 PM
> To: Bjorn Helgaas
> Cc: Keith Busch; linux-pci@vger.kernel.org; Lorenzo Pieralisi; Rob Herring;
> kw@linux.com; Bjorn Helgaas; linux-kernel@vger.kernel.org; Tom Joseph; Jens
> Axboe; Christoph Hellwig; Sagi Grimberg; linux-nvme@lists.infradead.org
> Subject: RE: [EXT] Re: nvme may get timeout from dd when using different non-
> prefetch mmio outbound/ranges
> 
> Hi, Bjorn
> 
> > -----Original Message-----
> > From: Bjorn Helgaas [mailto:helgaas@kernel.org]
> > Sent: Saturday, October 30, 2021 3:43 AM
> > To: Li Chen
> > Cc: Keith Busch; linux-pci@vger.kernel.org; Lorenzo Pieralisi; Rob Herring;
> > kw@linux.com; Bjorn Helgaas; linux-kernel@vger.kernel.org; Tom Joseph; Jens
> > Axboe; Christoph Hellwig; Sagi Grimberg; linux-nvme@lists.infradead.org
> > Subject: Re: [EXT] Re: nvme may get timeout from dd when using different
> non-
> > prefetch mmio outbound/ranges
> >
> > On Fri, Oct 29, 2021 at 10:52:37AM +0000, Li Chen wrote:
> > > > -----Original Message-----
> > > > From: Keith Busch [mailto:kbusch@kernel.org]
> > > > Sent: Tuesday, October 26, 2021 12:16 PM
> > > > To: Li Chen
> > > > Cc: Bjorn Helgaas; linux-pci@vger.kernel.org; Lorenzo Pieralisi; Rob Herring;
> > > > kw@linux.com; Bjorn Helgaas; linux-kernel@vger.kernel.org; Tom Joseph;
> > Jens
> > > > Axboe; Christoph Hellwig; Sagi Grimberg; linux-nvme@lists.infradead.org
> > > > Subject: Re: [EXT] Re: nvme may get timeout from dd when using different
> > non-
> > > > prefetch mmio outbound/ranges
> > > >
> > > > On Tue, Oct 26, 2021 at 03:40:54AM +0000, Li Chen wrote:
> > > > > My nvme is " 05:00.0 Non-Volatile memory controller: Samsung Electronics
> > Co
> > > > Ltd NVMe SSD Controller 980". From its datasheet,
> > > > https://urldefense.com/v3/__https://s3.ap-northeast-
> > > >
> >
> 2.amazonaws.com/global.semi.static/Samsung_NVMe_SSD_980_Data_Sheet_R
> > > >
> >
> ev.1.1.pdf__;!!PeEy7nZLVv0!3MU3LdTWuzON9JMUkq29zwJM4d7g7wKtkiZszTu-
> > > > PVepWchI_uLHpQGgdR_LEZM$ , it says nothing about CMB/SQEs, so I'm
> not
> > sure.
> > > > Is there other ways/tools(like nvme-cli) to query?
> > > >
> > > > The driver will export a sysfs property for it if it is supported:
> > > >
> > > >   # cat /sys/class/nvme/nvme0/cmb
> > > >
> > > > If the file doesn't exist, then /dev/nvme0 doesn't have the capability.
> > > >
> > > > > > > I don't know how to interpret "ranges".  Can you supply the dmesg
> and
> > > > > > > "lspci -vvs 0000:05:00.0" output both ways, e.g.,
> > > > > > >
> > > > > > >   pci_bus 0000:00: root bus resource [mem 0x7f800000-0xefffffff
> > window]
> > > > > > >   pci_bus 0000:00: root bus resource [mem 0xfd000000-0xfe7fffff
> > window]
> > > > > > >   pci 0000:05:00.0: [vvvv:dddd] type 00 class 0x...
> > > > > > >   pci 0000:05:00.0: reg 0x10: [mem 0x.....000-0x.....fff ...]
> > > > > > >
> > > > > > > > Question:
> > > > > > > > 1.  Why dd can cause nvme timeout? Is there more debug ways?
> > > > > >
> > > > > > That means the nvme controller didn't provide a response to a posted
> > > > > > command within the driver's latency tolerance.
> > > > >
> > > > > FYI, with the help of pci bridger's vendor, they find something
> > > > > interesting:
> > > > "From catc log, I saw some memory read pkts sent from SSD card,
> > > > but its memory range is within the memory range of switch down
> > > > port. So, switch down port will replay UR pkt. It seems not
> > > > normal." and "Why SSD card send out some memory pkts which memory
> > > > address is within switch down port's memory range. If so, switch
> > > > will response UR pkts". I also don't understand how can this
> > > > happen?
> > > >
> > > > I think we can safely assume you're not attempting peer-to-peer,
> > > > so that behavior as described shouldn't be happening. It sounds
> > > > like the memory windows may be incorrect. The dmesg may help to
> > > > show if something appears wrong.
> > >
> > > Agree that here doesn't involve peer-to-peer DMA. After conforming
> > > from switch vendor today, the two ur(unsupported request) is because
> > > nvme is trying to dma read dram with bus address 80d5000 and
> > > 80d5100. But the two bus addresses are located in switch's down port
> > > range, so the switch down port report ur.
> > >
> > > In our soc, dma/bus/pci address and physical/AXI address are 1:1,
> > > and DRAM space in physical memory address space is 000000.0000 -
> > > 0fffff.ffff 64G, so bus address 80d5000 and 80d5100 to cpu address
> > > are also 80d5000 and 80d5100, which both located inside dram space.
> > >
> > > Both our bootloader and romcode don't enum and configure pcie
> > > devices and switches, so the switch cfg stage should be left to
> > > kernel.
> > >
> > > Come back to the subject of this thread: " nvme may get timeout from
> > > dd when using different non-prefetch mmio outbound/ranges". I found:
> > >
> > > 1. For <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x04000000>;
> > > (which will timeout nvme)
> > >
> > > Switch(bridge of nvme)'s resource window:
> > > Memory behind bridge: Memory behind bridge: 08000000-080fffff [size=1M]
> > >
> > > 80d5000 and 80d5100 are both inside this range.
> >
> > The PCI host bridge MMIO window is here:
> >
> >   pci_bus 0000:00: root bus resource [mem 0x2008000000-0x200bffffff] (bus
> > address [0x08000000-0x0bffffff])
> >   pci 0000:01:00.0: PCI bridge to [bus 02-05]
> >   pci 0000:01:00.0:   bridge window [mem 0x2008000000-0x20080fffff]
> >   pci 0000:02:06.0: PCI bridge to [bus 05]
> >   pci 0000:02:06.0:   bridge window [mem 0x2008000000-0x20080fffff]
> >   pci 0000:05:00.0: BAR 0: assigned [mem 0x2008000000-0x2008003fff 64bit]
> >
> > So bus address [0x08000000-0x0bffffff] is the area used for PCI BARs.
> > If the NVMe device is generating DMA transactions to 0x080d5000, which
> > is inside that range, those will be interpreted as peer-to-peer
> > transactions.  But obviously that's not intended and there's no device
> > at 0x080d5000 anyway.
> >
> > My guess is the nvme driver got 0x080d5000 from the DMA API, e.g.,
> > dma_map_bvec() or dma_map_sg_attrs(), so maybe there's something wrong
> > in how that's set up.  Is there an IOMMU?  There should be arch code
> > that knows what RAM is available for DMA buffers, maybe based on the
> > DT.  I'm not really familiar with how all that would be arranged, but
> > the complete dmesg log and complete DT might have a clue.  Can you
> > post those somewhere?
> 
> After some printk, I found nvme_pci_setup_prps get some dma addresses inside
> switch's memory window from sg, but I don't where the sg is from(see
> comments in following source codes):

I just noticed it should come from mempool_alloc in nvme_map_data:
static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
		struct nvme_command *cmnd)
{
	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
	blk_status_t ret = BLK_STS_RESOURCE;
	int nr_mapped;

	if (blk_rq_nr_phys_segments(req) == 1) {
		struct bio_vec bv = req_bvec(req);

		if (!is_pci_p2pdma_page(bv.bv_page)) {
			if (bv.bv_offset + bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2)
				return nvme_setup_prp_simple(dev, req,
							     &cmnd->rw, &bv);

			if (iod->nvmeq->qid &&
			    dev->ctrl.sgls & ((1 << 0) | (1 << 1)))
				return nvme_setup_sgl_simple(dev, req,
							     &cmnd->rw, &bv);
		}
	}

	iod->dma_len = 0;
	iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);


	unsigned int l = sg_dma_address(iod->sg), r = sg_dma_address(iod->sg) + sg_dma_len(iod->sg), tl = 0x08000000, tr = tl + 0x04000000, ntl = 0x00400000, ntr = ntl + 0x08000000;

/*
# dmesg | grep "region-timeout ? 1" | grep nvme_map_data
[   16.002446] lchen nvme_map_data, 895: first_dma 0, end_dma 8b1c21c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[   16.341240] lchen nvme_map_data, 895: first_dma 0, end_dma 8b1a21c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[   16.405938] lchen nvme_map_data, 895: first_dma 0, end_dma 8b1821c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[   36.126917] lchen nvme_map_data, 895: first_dma 0, end_dma 8f2421c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[   36.703839] lchen nvme_map_data, 895: first_dma 0, end_dma 8f2221c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[   38.510086] lchen nvme_map_data, 895: first_dma 0, end_dma 89c621c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[   40.542394] lchen nvme_map_data, 895: first_dma 0, end_dma 89c421c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[   40.573405] lchen nvme_map_data, 895: first_dma 0, end_dma 87ee21c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[   40.604419] lchen nvme_map_data, 895: first_dma 0, end_dma 87ec21c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[   40.874395] lchen nvme_map_data, 895: first_dma 0, end_dma 8aa221c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[   40.905323] lchen nvme_map_data, 895: first_dma 0, end_dma 8aa021c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[   40.968675] lchen nvme_map_data, 895: first_dma 0, end_dma 8a0e21c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[   40.999659] lchen nvme_map_data, 895: first_dma 0, end_dma 8a0821c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[   41.030601] lchen nvme_map_data, 895: first_dma 0, end_dma 8bde21c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[   41.061629] lchen nvme_map_data, 895: first_dma 0, end_dma 8b1e21c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[   41.092598] lchen nvme_map_data, 895: first_dma 0, end_dma 8eb621c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[   41.123677] lchen nvme_map_data, 895: first_dma 0, end_dma 8eb221c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[   41.160960] lchen nvme_map_data, 895: first_dma 0, end_dma 8cee21c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[   41.193609] lchen nvme_map_data, 895: first_dma 0, end_dma 8cec21c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[   41.224607] lchen nvme_map_data, 895: first_dma 0, end_dma 8c7e21c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[   41.255592] lchen nvme_map_data, 895: first_dma 0, end_dma 8c7c21c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[   41.286594] lchen nvme_map_data, 895: first_dma 0, end_dma 8c7821c, inside region-timeout ? 1, is inside region-non-timeout ? 0
*/
	printk("lchen %s, %d: first_dma %llx, end_dma %llx, inside region-timeout ? %d, is inside region-non-timeout ? %d", __func__, __LINE__, l, r, l >= tl && l <= tr || r >= tl && r <= tr, l >= ntl && l <= ntr || r >= ntl && r <= ntr);



	//printk("lchen %s %d, addr starts %llx, ends %llx", __func__, __LINE__, sg_dma_address(iod->sg), sg_dma_address(iod->sg) + sg_dma_len(iod->sg));

    ................
}

> 
> static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
> 		struct request *req, struct nvme_rw_command *cmnd)
> {
> 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> 	struct dma_pool *pool;
> 	int length = blk_rq_payload_bytes(req);
> 	struct scatterlist *sg = iod->sg;
> 	int dma_len = sg_dma_len(sg);
> 	u64 dma_addr = sg_dma_address(sg);
>                 ......
> 	for (;;) {
> 		if (i == NVME_CTRL_PAGE_SIZE >> 3) {
> 			__le64 *old_prp_list = prp_list;
> 			prp_list = dma_pool_alloc(pool, GFP_ATOMIC,
> &prp_dma);
> 			printk("lchen %s %d dma pool %llx", __func__, __LINE__,
> prp_dma);
> 			if (!prp_list)
> 				goto free_prps;
> 			list[iod->npages++] = prp_list;
> 			prp_list[0] = old_prp_list[i - 1];
> 			old_prp_list[i - 1] = cpu_to_le64(prp_dma);
> 			i = 1;
> 		}
> 		prp_list[i++] = cpu_to_le64(dma_addr);
> 		dma_len -= NVME_CTRL_PAGE_SIZE;
> 		dma_addr += NVME_CTRL_PAGE_SIZE;
> 		length -= NVME_CTRL_PAGE_SIZE;
> 		if (length <= 0)
> 			break;
> 		if (dma_len > 0)
> 			continue;
> 		if (unlikely(dma_len < 0))
> 			goto bad_sgl;
> 		sg = sg_next(sg);
> 		dma_addr = sg_dma_address(sg);
> 		dma_len = sg_dma_len(sg);
> 
> 
> 		// XXX: Here get the following output, the region is inside bridge's
> window 08000000-080fffff [size=1M]
> 		/*
> 
> # dmesg | grep " 80" | grep -v "  80"
> [    0.000476] Console: colour dummy device 80x25
> [   79.331766] lchen dma nvme_pci_setup_prps 708 addr 80ba000, end addr
> 80bc000
> [   79.815469] lchen dma nvme_pci_setup_prps 708 addr 8088000, end addr
> 8090000
> [  111.562129] lchen dma nvme_pci_setup_prps 708 addr 8088000, end addr
> 8090000
> [  111.873690] lchen dma nvme_pci_setup_prps 708 addr 80ba000, end addr
> 80bc000
> 			 * * */
> 		printk("lchen dma %s %d addr %llx, end addr %llx", __func__,
> __LINE__, dma_addr, dma_addr + dma_len);
> 	}
> done:
> 	cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
> 	cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
> 	return BLK_STS_OK;
> free_prps:
> 	nvme_free_prps(dev, req);
> 	return BLK_STS_RESOURCE;
> bad_sgl:
> 	WARN(DO_ONCE(nvme_print_sgl, iod->sg, iod->nents),
> 			"Invalid SGL for payload:%d nents:%d\n",
> 			blk_rq_payload_bytes(req), iod->nents);
> 	return BLK_STS_IOERR;
> }
> 
> Backtrace of this function:
> # entries-in-buffer/entries-written: 1574/1574   #P:2
> #
> #                                _-----=> irqs-off
> #                               / _----=> need-resched
> #                              | / _---=> hardirq/softirq
> #                              || / _--=> preempt-depth
> #                              ||| /     delay
> #           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
> #              | |         |   ||||      |         |
>     kworker/u4:0-7       [000] ...1    40.095494: nvme_queue_rq <-
> blk_mq_dispatch_rq_list
>     kworker/u4:0-7       [000] ...1    40.095503: <stack trace>
>  => nvme_queue_rq
>  => blk_mq_dispatch_rq_list
>  => __blk_mq_do_dispatch_sched
>  => __blk_mq_sched_dispatch_requests
>  => blk_mq_sched_dispatch_requests
>  => __blk_mq_run_hw_queue
>  => __blk_mq_delay_run_hw_queue
>  => blk_mq_run_hw_queue
>  => blk_mq_sched_insert_requests
>  => blk_mq_flush_plug_list
>  => blk_flush_plug_list
>  => blk_mq_submit_bio
>  => __submit_bio_noacct_mq
>  => submit_bio_noacct
>  => submit_bio
>  => submit_bh_wbc.constprop.0
>  => __block_write_full_page
>  => block_write_full_page
>  => blkdev_writepage
>  => __writepage
>  => write_cache_pages
>  => generic_writepages
>  => blkdev_writepages
>  => do_writepages
>  => __writeback_single_inode
>  => writeback_sb_inodes
>  => __writeback_inodes_wb
>  => wb_writeback
>  => wb_do_writeback
>  => wb_workfn
>  => process_one_work
>  => worker_thread
>  => kthread
>  => ret_from_fork
> 
> 
> We don't have IOMMU and just have 1:1 mapping dma outbound.
> 
> 
> Here is the whole dmesg output(without my debug log):
> https://paste.debian.net/1217721/
> Here is our dtsi: https://paste.debian.net/1217723/
> >
> > > 2. For <0x02000000 0x00 0x00400000 0x20 0x00400000 0x00 0x08000000>;
> > > (which make nvme not timeout)
> > >
> > > Switch(bridge of nvme)'s resource window:
> > > Memory behind bridge: Memory behind bridge: 00400000-004fffff [size=1M]
> > >
> > > 80d5000 and 80d5100 are not inside this range, so if nvme tries to
> > > read 80d5000 and 80d5100 , ur won't happen.
> > >
> > > From /proc/iomen:
> > > # cat /proc/iomem
> > > 01200000-ffffffff : System RAM
> > >   01280000-022affff : Kernel code
> > >   022b0000-0295ffff : reserved
> > >   02960000-040cffff : Kernel data
> > >   05280000-0528ffff : reserved
> > >   41cc0000-422c0fff : reserved
> > >   422c1000-4232afff : reserved
> > >   4232d000-667bbfff : reserved
> > >   667bc000-667bcfff : reserved
> > >   667bd000-667c0fff : reserved
> > >   667c1000-ffffffff : reserved
> > > 2000000000-2000000fff : cfg
> > >
> > > No one uses 0000000-1200000, so " Memory behind bridge: Memory
> > > behind bridge: 00400000-004fffff [size=1M]" will never have any
> > > problem(because 0x1200000 > 0x004fffff).
> > >
> > > Above answers the question in Subject, one question left: what's the
> > > right way to resolve this problem? Use ranges property to configure
> > > switch memory window indirectly(just what I did)? Or something else?
> > >
> > > I don't think changing range property is the right way: If my PCIe
> > > topology becomes more complex and have more endpoints or switches,
> > > maybe I have to reserve more MMIO through range property(please
> > > correct me if I'm wrong), the end of switch's memory window may be
> > > larger than 0x01200000. In case getting ur again,  I must reserve
> > > more physical memory address for them(like change kernel start
> > > address 0x01200000 to 0x02000000), which will make my visible dram
> > > smaller(I have verified it with "free -m"), it is not acceptable.
> >
> > Right, I don't think changing the PCI ranges property is the right
> > answer.  I think it's just a coincidence that moving the host bridge
> > MMIO aperture happens to move it out of the way of the DMA to
> > 0x080d5000.
> >
> > As far as I can tell, the PCI core and the nvme driver are doing the
> > right things here, and the problem is something behind the DMA API.
> >
> > I think there should be something that removes the MMIO aperture bus
> > addresses, i.e., 0x08000000-0x0bffffff in the timeout case, from the
> > pool of memory available for DMA buffers.
> >
> > The MMIO aperture bus addresses in the non-timeout case,
> > 0x00400000-0x083fffff, are not included in the 0x01200000-0xffffffff
> > System RAM area, which would explain why a DMA buffer would never
> > overlap with it.
> >
> > Bjorn
> 
> Regards,
> Li

Regards,
Li

**********************************************************************
This email and attachments contain Ambarella Proprietary and/or Confidential Information and is intended solely for the use of the individual(s) to whom it is addressed. Any unauthorized review, use, disclosure, distribute, copy, or print is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy all copies of the original message. Thank you.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [EXT] Re: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges
  2021-11-02  7:12               ` Li Chen
@ 2021-11-03 10:04                 ` Li Chen
  2021-11-29  3:04                 ` Li Chen
  1 sibling, 0 replies; 14+ messages in thread
From: Li Chen @ 2021-11-03 10:04 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Keith Busch, linux-pci, Lorenzo Pieralisi, Rob Herring, kw,
	Bjorn Helgaas, linux-kernel, Tom Joseph, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme



> -----Original Message-----
> From: Li Chen
> Sent: Tuesday, November 2, 2021 3:12 PM
> To: Bjorn Helgaas
> Cc: Keith Busch; linux-pci@vger.kernel.org; Lorenzo Pieralisi; Rob Herring;
> kw@linux.com; Bjorn Helgaas; linux-kernel@vger.kernel.org; Tom Joseph; Jens
> Axboe; Christoph Hellwig; Sagi Grimberg; linux-nvme@lists.infradead.org
> Subject: RE: [EXT] Re: nvme may get timeout from dd when using different non-
> prefetch mmio outbound/ranges
> 
> 
> 
> > -----Original Message-----
> > From: Li Chen
> > Sent: Tuesday, November 2, 2021 1:18 PM
> > To: Bjorn Helgaas
> > Cc: Keith Busch; linux-pci@vger.kernel.org; Lorenzo Pieralisi; Rob Herring;
> > kw@linux.com; Bjorn Helgaas; linux-kernel@vger.kernel.org; Tom Joseph; Jens
> > Axboe; Christoph Hellwig; Sagi Grimberg; linux-nvme@lists.infradead.org
> > Subject: RE: [EXT] Re: nvme may get timeout from dd when using different
> non-
> > prefetch mmio outbound/ranges
> >
> > Hi, Bjorn
> >
> > > -----Original Message-----
> > > From: Bjorn Helgaas [mailto:helgaas@kernel.org]
> > > Sent: Saturday, October 30, 2021 3:43 AM
> > > To: Li Chen
> > > Cc: Keith Busch; linux-pci@vger.kernel.org; Lorenzo Pieralisi; Rob Herring;
> > > kw@linux.com; Bjorn Helgaas; linux-kernel@vger.kernel.org; Tom Joseph;
> Jens
> > > Axboe; Christoph Hellwig; Sagi Grimberg; linux-nvme@lists.infradead.org
> > > Subject: Re: [EXT] Re: nvme may get timeout from dd when using different
> > non-
> > > prefetch mmio outbound/ranges
> > >
> > > On Fri, Oct 29, 2021 at 10:52:37AM +0000, Li Chen wrote:
> > > > > -----Original Message-----
> > > > > From: Keith Busch [mailto:kbusch@kernel.org]
> > > > > Sent: Tuesday, October 26, 2021 12:16 PM
> > > > > To: Li Chen
> > > > > Cc: Bjorn Helgaas; linux-pci@vger.kernel.org; Lorenzo Pieralisi; Rob
> Herring;
> > > > > kw@linux.com; Bjorn Helgaas; linux-kernel@vger.kernel.org; Tom Joseph;
> > > Jens
> > > > > Axboe; Christoph Hellwig; Sagi Grimberg; linux-nvme@lists.infradead.org
> > > > > Subject: Re: [EXT] Re: nvme may get timeout from dd when using
> different
> > > non-
> > > > > prefetch mmio outbound/ranges
> > > > >
> > > > > On Tue, Oct 26, 2021 at 03:40:54AM +0000, Li Chen wrote:
> > > > > > My nvme is " 05:00.0 Non-Volatile memory controller: Samsung
> Electronics
> > > Co
> > > > > Ltd NVMe SSD Controller 980". From its datasheet,
> > > > > https://urldefense.com/v3/__https://s3.ap-northeast-
> > > > >
> > >
> >
> 2.amazonaws.com/global.semi.static/Samsung_NVMe_SSD_980_Data_Sheet_R
> > > > >
> > >
> >
> ev.1.1.pdf__;!!PeEy7nZLVv0!3MU3LdTWuzON9JMUkq29zwJM4d7g7wKtkiZszTu-
> > > > > PVepWchI_uLHpQGgdR_LEZM$ , it says nothing about CMB/SQEs, so I'm
> > not
> > > sure.
> > > > > Is there other ways/tools(like nvme-cli) to query?
> > > > >
> > > > > The driver will export a sysfs property for it if it is supported:
> > > > >
> > > > >   # cat /sys/class/nvme/nvme0/cmb
> > > > >
> > > > > If the file doesn't exist, then /dev/nvme0 doesn't have the capability.
> > > > >
> > > > > > > > I don't know how to interpret "ranges".  Can you supply the dmesg
> > and
> > > > > > > > "lspci -vvs 0000:05:00.0" output both ways, e.g.,
> > > > > > > >
> > > > > > > >   pci_bus 0000:00: root bus resource [mem 0x7f800000-0xefffffff
> > > window]
> > > > > > > >   pci_bus 0000:00: root bus resource [mem 0xfd000000-0xfe7fffff
> > > window]
> > > > > > > >   pci 0000:05:00.0: [vvvv:dddd] type 00 class 0x...
> > > > > > > >   pci 0000:05:00.0: reg 0x10: [mem 0x.....000-0x.....fff ...]
> > > > > > > >
> > > > > > > > > Question:
> > > > > > > > > 1.  Why dd can cause nvme timeout? Is there more debug ways?
> > > > > > >
> > > > > > > That means the nvme controller didn't provide a response to a posted
> > > > > > > command within the driver's latency tolerance.
> > > > > >
> > > > > > FYI, with the help of pci bridger's vendor, they find something
> > > > > > interesting:
> > > > > "From catc log, I saw some memory read pkts sent from SSD card,
> > > > > but its memory range is within the memory range of switch down
> > > > > port. So, switch down port will replay UR pkt. It seems not
> > > > > normal." and "Why SSD card send out some memory pkts which memory
> > > > > address is within switch down port's memory range. If so, switch
> > > > > will response UR pkts". I also don't understand how can this
> > > > > happen?
> > > > >
> > > > > I think we can safely assume you're not attempting peer-to-peer,
> > > > > so that behavior as described shouldn't be happening. It sounds
> > > > > like the memory windows may be incorrect. The dmesg may help to
> > > > > show if something appears wrong.
> > > >
> > > > Agree that here doesn't involve peer-to-peer DMA. After conforming
> > > > from switch vendor today, the two ur(unsupported request) is because
> > > > nvme is trying to dma read dram with bus address 80d5000 and
> > > > 80d5100. But the two bus addresses are located in switch's down port
> > > > range, so the switch down port report ur.
> > > >
> > > > In our soc, dma/bus/pci address and physical/AXI address are 1:1,
> > > > and DRAM space in physical memory address space is 000000.0000 -
> > > > 0fffff.ffff 64G, so bus address 80d5000 and 80d5100 to cpu address
> > > > are also 80d5000 and 80d5100, which both located inside dram space.
> > > >
> > > > Both our bootloader and romcode don't enum and configure pcie
> > > > devices and switches, so the switch cfg stage should be left to
> > > > kernel.
> > > >
> > > > Come back to the subject of this thread: " nvme may get timeout from
> > > > dd when using different non-prefetch mmio outbound/ranges". I found:
> > > >
> > > > 1. For <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x04000000>;
> > > > (which will timeout nvme)
> > > >
> > > > Switch(bridge of nvme)'s resource window:
> > > > Memory behind bridge: Memory behind bridge: 08000000-080fffff [size=1M]
> > > >
> > > > 80d5000 and 80d5100 are both inside this range.
> > >
> > > The PCI host bridge MMIO window is here:
> > >
> > >   pci_bus 0000:00: root bus resource [mem 0x2008000000-0x200bffffff] (bus
> > > address [0x08000000-0x0bffffff])
> > >   pci 0000:01:00.0: PCI bridge to [bus 02-05]
> > >   pci 0000:01:00.0:   bridge window [mem 0x2008000000-0x20080fffff]
> > >   pci 0000:02:06.0: PCI bridge to [bus 05]
> > >   pci 0000:02:06.0:   bridge window [mem 0x2008000000-0x20080fffff]
> > >   pci 0000:05:00.0: BAR 0: assigned [mem 0x2008000000-0x2008003fff 64bit]
> > >
> > > So bus address [0x08000000-0x0bffffff] is the area used for PCI BARs.
> > > If the NVMe device is generating DMA transactions to 0x080d5000, which
> > > is inside that range, those will be interpreted as peer-to-peer
> > > transactions.  But obviously that's not intended and there's no device
> > > at 0x080d5000 anyway.
> > >
> > > My guess is the nvme driver got 0x080d5000 from the DMA API, e.g.,
> > > dma_map_bvec() or dma_map_sg_attrs(), so maybe there's something
> wrong
> > > in how that's set up.  Is there an IOMMU?  There should be arch code
> > > that knows what RAM is available for DMA buffers, maybe based on the
> > > DT.  I'm not really familiar with how all that would be arranged, but
> > > the complete dmesg log and complete DT might have a clue.  Can you
> > > post those somewhere?
> >
> > After some printk, I found nvme_pci_setup_prps get some dma addresses
> inside
> > switch's memory window from sg, but I don't where the sg is from(see
> > comments in following source codes):
> 
> I just noticed it should come from mempool_alloc in nvme_map_data:
> static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
> 		struct nvme_command *cmnd)
> {
> 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> 	blk_status_t ret = BLK_STS_RESOURCE;
> 	int nr_mapped;
> 
> 	if (blk_rq_nr_phys_segments(req) == 1) {
> 		struct bio_vec bv = req_bvec(req);
> 
> 		if (!is_pci_p2pdma_page(bv.bv_page)) {
> 			if (bv.bv_offset + bv.bv_len <= NVME_CTRL_PAGE_SIZE
> * 2)
> 				return nvme_setup_prp_simple(dev, req,
> 							     &cmnd->rw, &bv);
> 
> 			if (iod->nvmeq->qid &&
> 			    dev->ctrl.sgls & ((1 << 0) | (1 << 1)))
> 				return nvme_setup_sgl_simple(dev, req,
> 							     &cmnd->rw, &bv);
> 		}
> 	}
> 
> 	iod->dma_len = 0;
> 	iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
> 
> 
> 	unsigned int l = sg_dma_address(iod->sg), r = sg_dma_address(iod->sg)
> + sg_dma_len(iod->sg), tl = 0x08000000, tr = tl + 0x04000000, ntl = 0x00400000, ntr
> = ntl + 0x08000000;
> 
> /*
> # dmesg | grep "region-timeout ? 1" | grep nvme_map_data
> [   16.002446] lchen nvme_map_data, 895: first_dma 0, end_dma 8b1c21c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [   16.341240] lchen nvme_map_data, 895: first_dma 0, end_dma 8b1a21c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [   16.405938] lchen nvme_map_data, 895: first_dma 0, end_dma 8b1821c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [   36.126917] lchen nvme_map_data, 895: first_dma 0, end_dma 8f2421c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [   36.703839] lchen nvme_map_data, 895: first_dma 0, end_dma 8f2221c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [   38.510086] lchen nvme_map_data, 895: first_dma 0, end_dma 89c621c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [   40.542394] lchen nvme_map_data, 895: first_dma 0, end_dma 89c421c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [   40.573405] lchen nvme_map_data, 895: first_dma 0, end_dma 87ee21c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [   40.604419] lchen nvme_map_data, 895: first_dma 0, end_dma 87ec21c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [   40.874395] lchen nvme_map_data, 895: first_dma 0, end_dma 8aa221c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [   40.905323] lchen nvme_map_data, 895: first_dma 0, end_dma 8aa021c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [   40.968675] lchen nvme_map_data, 895: first_dma 0, end_dma 8a0e21c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [   40.999659] lchen nvme_map_data, 895: first_dma 0, end_dma 8a0821c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [   41.030601] lchen nvme_map_data, 895: first_dma 0, end_dma 8bde21c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [   41.061629] lchen nvme_map_data, 895: first_dma 0, end_dma 8b1e21c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [   41.092598] lchen nvme_map_data, 895: first_dma 0, end_dma 8eb621c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [   41.123677] lchen nvme_map_data, 895: first_dma 0, end_dma 8eb221c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [   41.160960] lchen nvme_map_data, 895: first_dma 0, end_dma 8cee21c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [   41.193609] lchen nvme_map_data, 895: first_dma 0, end_dma 8cec21c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [   41.224607] lchen nvme_map_data, 895: first_dma 0, end_dma 8c7e21c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [   41.255592] lchen nvme_map_data, 895: first_dma 0, end_dma 8c7c21c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [   41.286594] lchen nvme_map_data, 895: first_dma 0, end_dma 8c7821c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> */
> 	printk("lchen %s, %d: first_dma %llx, end_dma %llx, inside region-
> timeout ? %d, is inside region-non-timeout ? %d", __func__, __LINE__, l, r, l >= tl
> && l <= tr || r >= tl && r <= tr, l >= ntl && l <= ntr || r >= ntl && r <= ntr);
> 
> 
> 
> 	//printk("lchen %s %d, addr starts %llx, ends %llx", __func__, __LINE__,
> sg_dma_address(iod->sg), sg_dma_address(iod->sg) + sg_dma_len(iod->sg));
> 
>     ................
> }
> 
> >
> > static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
> > 		struct request *req, struct nvme_rw_command *cmnd)
> > {
> > 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> > 	struct dma_pool *pool;
> > 	int length = blk_rq_payload_bytes(req);
> > 	struct scatterlist *sg = iod->sg;
> > 	int dma_len = sg_dma_len(sg);
> > 	u64 dma_addr = sg_dma_address(sg);
> >                 ......
> > 	for (;;) {
> > 		if (i == NVME_CTRL_PAGE_SIZE >> 3) {
> > 			__le64 *old_prp_list = prp_list;
> > 			prp_list = dma_pool_alloc(pool, GFP_ATOMIC,
> > &prp_dma);
> > 			printk("lchen %s %d dma pool %llx", __func__, __LINE__,
> > prp_dma);
> > 			if (!prp_list)
> > 				goto free_prps;
> > 			list[iod->npages++] = prp_list;
> > 			prp_list[0] = old_prp_list[i - 1];
> > 			old_prp_list[i - 1] = cpu_to_le64(prp_dma);
> > 			i = 1;
> > 		}
> > 		prp_list[i++] = cpu_to_le64(dma_addr);
> > 		dma_len -= NVME_CTRL_PAGE_SIZE;
> > 		dma_addr += NVME_CTRL_PAGE_SIZE;
> > 		length -= NVME_CTRL_PAGE_SIZE;
> > 		if (length <= 0)
> > 			break;
> > 		if (dma_len > 0)
> > 			continue;
> > 		if (unlikely(dma_len < 0))
> > 			goto bad_sgl;
> > 		sg = sg_next(sg);
> > 		dma_addr = sg_dma_address(sg);
> > 		dma_len = sg_dma_len(sg);
> >
> >
> > 		// XXX: Here get the following output, the region is inside bridge's
> > window 08000000-080fffff [size=1M]
> > 		/*
> >
> > # dmesg | grep " 80" | grep -v "  80"
> > [    0.000476] Console: colour dummy device 80x25
> > [   79.331766] lchen dma nvme_pci_setup_prps 708 addr 80ba000, end addr
> > 80bc000
> > [   79.815469] lchen dma nvme_pci_setup_prps 708 addr 8088000, end addr
> > 8090000
> > [  111.562129] lchen dma nvme_pci_setup_prps 708 addr 8088000, end addr
> > 8090000
> > [  111.873690] lchen dma nvme_pci_setup_prps 708 addr 80ba000, end addr
> > 80bc000
> > 			 * * */
> > 		printk("lchen dma %s %d addr %llx, end addr %llx", __func__,
> > __LINE__, dma_addr, dma_addr + dma_len);
> > 	}
> > done:
> > 	cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
> > 	cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
> > 	return BLK_STS_OK;
> > free_prps:
> > 	nvme_free_prps(dev, req);
> > 	return BLK_STS_RESOURCE;
> > bad_sgl:
> > 	WARN(DO_ONCE(nvme_print_sgl, iod->sg, iod->nents),
> > 			"Invalid SGL for payload:%d nents:%d\n",
> > 			blk_rq_payload_bytes(req), iod->nents);
> > 	return BLK_STS_IOERR;
> > }
> >
> > Backtrace of this function:
> > # entries-in-buffer/entries-written: 1574/1574   #P:2
> > #
> > #                                _-----=> irqs-off
> > #                               / _----=> need-resched
> > #                              | / _---=> hardirq/softirq
> > #                              || / _--=> preempt-depth
> > #                              ||| /     delay
> > #           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
> > #              | |         |   ||||      |         |
> >     kworker/u4:0-7       [000] ...1    40.095494: nvme_queue_rq <-
> > blk_mq_dispatch_rq_list
> >     kworker/u4:0-7       [000] ...1    40.095503: <stack trace>
> >  => nvme_queue_rq
> >  => blk_mq_dispatch_rq_list
> >  => __blk_mq_do_dispatch_sched
> >  => __blk_mq_sched_dispatch_requests
> >  => blk_mq_sched_dispatch_requests
> >  => __blk_mq_run_hw_queue
> >  => __blk_mq_delay_run_hw_queue
> >  => blk_mq_run_hw_queue
> >  => blk_mq_sched_insert_requests
> >  => blk_mq_flush_plug_list
> >  => blk_flush_plug_list
> >  => blk_mq_submit_bio
> >  => __submit_bio_noacct_mq
> >  => submit_bio_noacct
> >  => submit_bio
> >  => submit_bh_wbc.constprop.0
> >  => __block_write_full_page
> >  => block_write_full_page
> >  => blkdev_writepage
> >  => __writepage
> >  => write_cache_pages
> >  => generic_writepages
> >  => blkdev_writepages
> >  => do_writepages
> >  => __writeback_single_inode
> >  => writeback_sb_inodes
> >  => __writeback_inodes_wb
> >  => wb_writeback
> >  => wb_do_writeback
> >  => wb_workfn
> >  => process_one_work
> >  => worker_thread
> >  => kthread
> >  => ret_from_fork
> >
> >
> > We don't have IOMMU and just have 1:1 mapping dma outbound.
> >
> >
> > Here is the whole dmesg output(without my debug log):
> > https://paste.debian.net/1217721/


From dmesg:


[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000001200000-0x00000000ffffffff]
[    0.000000]   DMA32    empty
[    0.000000]   Normal   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000001200000-0x00000000ffffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000001200000-0x00000000ffffffff]
[    0.000000] On node 0 totalpages: 1043968
[    0.000000]   DMA zone: 16312 pages used for memmap
[    0.000000]   DMA zone: 0 pages reserved
[    0.000000]   DMA zone: 1043968 pages, LIFO batch:63

I only has one zone: ZONE_DMA, and it includes all physical memory:

# cat /proc/iomem
01200000-ffffffff : System RAM
  01280000-01c3ffff : Kernel code
  01c40000-01ebffff : reserved
  01ec0000-0201ffff : Kernel data
  05280000-0528ffff : reserved
  62000000-65ffffff : reserved
  66186000-66786fff : reserved
  66787000-667c0fff : reserved
  667c3000-667c3fff : reserved
  667c4000-667c5fff : reserved
  667c6000-667c8fff : reserved
  667c9000-ffffffff : reserved
2000000000-2000000fff : cfg
2008000000-200bffffff : pcie-controller@2040000000
  2008000000-20081fffff : PCI Bus 0000:01
    2008000000-20080fffff : PCI Bus 0000:02
      2008000000-20080fffff : PCI Bus 0000:05
        2008000000-2008003fff : 0000:05:00.0
          2008000000-2008003fff : nvme
    2008100000-200817ffff : 0000:01:00.0



From nvme codes:

dev->iod_mempool = mempool_create_node(1, mempool_kmalloc,
						mempool_kfree,
						(void *) alloc_size,
						GFP_KERNEL, node);

So mempool here just uses kmalloc and GFP_KERNEL to get physical memory. 

Does it mean that I should re-arrange zone_dma/zone_dma/zone_normal(like limit zone_dma to mem region that doesn't include pcie mmio) to fix this problem?



> > Here is our dtsi: https://paste.debian.net/1217723/
> > >
> > > > 2. For <0x02000000 0x00 0x00400000 0x20 0x00400000 0x00 0x08000000>;
> > > > (which make nvme not timeout)
> > > >
> > > > Switch(bridge of nvme)'s resource window:
> > > > Memory behind bridge: Memory behind bridge: 00400000-004fffff [size=1M]
> > > >
> > > > 80d5000 and 80d5100 are not inside this range, so if nvme tries to
> > > > read 80d5000 and 80d5100 , ur won't happen.
> > > >
> > > > From /proc/iomen:
> > > > # cat /proc/iomem
> > > > 01200000-ffffffff : System RAM
> > > >   01280000-022affff : Kernel code
> > > >   022b0000-0295ffff : reserved
> > > >   02960000-040cffff : Kernel data
> > > >   05280000-0528ffff : reserved
> > > >   41cc0000-422c0fff : reserved
> > > >   422c1000-4232afff : reserved
> > > >   4232d000-667bbfff : reserved
> > > >   667bc000-667bcfff : reserved
> > > >   667bd000-667c0fff : reserved
> > > >   667c1000-ffffffff : reserved
> > > > 2000000000-2000000fff : cfg
> > > >
> > > > No one uses 0000000-1200000, so " Memory behind bridge: Memory
> > > > behind bridge: 00400000-004fffff [size=1M]" will never have any
> > > > problem(because 0x1200000 > 0x004fffff).
> > > >
> > > > Above answers the question in Subject, one question left: what's the
> > > > right way to resolve this problem? Use ranges property to configure
> > > > switch memory window indirectly(just what I did)? Or something else?
> > > >
> > > > I don't think changing range property is the right way: If my PCIe
> > > > topology becomes more complex and have more endpoints or switches,
> > > > maybe I have to reserve more MMIO through range property(please
> > > > correct me if I'm wrong), the end of switch's memory window may be
> > > > larger than 0x01200000. In case getting ur again,  I must reserve
> > > > more physical memory address for them(like change kernel start
> > > > address 0x01200000 to 0x02000000), which will make my visible dram
> > > > smaller(I have verified it with "free -m"), it is not acceptable.
> > >
> > > Right, I don't think changing the PCI ranges property is the right
> > > answer.  I think it's just a coincidence that moving the host bridge
> > > MMIO aperture happens to move it out of the way of the DMA to
> > > 0x080d5000.
> > >
> > > As far as I can tell, the PCI core and the nvme driver are doing the
> > > right things here, and the problem is something behind the DMA API.
> > >
> > > I think there should be something that removes the MMIO aperture bus
> > > addresses, i.e., 0x08000000-0x0bffffff in the timeout case, from the
> > > pool of memory available for DMA buffers.
> > >
> > > The MMIO aperture bus addresses in the non-timeout case,
> > > 0x00400000-0x083fffff, are not included in the 0x01200000-0xffffffff
> > > System RAM area, which would explain why a DMA buffer would never
> > > overlap with it.
> > >
> > > Bjorn
> >
> > Regards,
> > Li
> 
> Regards,
> Li

Regards,
Li

**********************************************************************
This email and attachments contain Ambarella Proprietary and/or Confidential Information and is intended solely for the use of the individual(s) to whom it is addressed. Any unauthorized review, use, disclosure, distribute, copy, or print is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy all copies of the original message. Thank you.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [EXT] Re: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges
  2021-11-02  7:12               ` Li Chen
  2021-11-03 10:04                 ` Li Chen
@ 2021-11-29  3:04                 ` Li Chen
  1 sibling, 0 replies; 14+ messages in thread
From: Li Chen @ 2021-11-29  3:04 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Keith Busch, linux-pci, Lorenzo Pieralisi, Rob Herring, kw,
	Bjorn Helgaas, linux-kernel, Tom Joseph, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme

Here is a summary of my solution:

From amba bus view, dram mapped from 0-64G on our soc.

PCIe's non-prefetch mmio need be inside 0-4g, which
may be used as switch bridge's mmio.
if ep do dma, and the phy address is 0-4g, the dma address
will be the same, because cadence's codes currently only
support 1:1 mapping of dma and phy addr.

For example, if ep's upstream switch bridge use phy addr 4-5M
as mmio window and ep use 4-5M as dma addr, then when the tlp
comes to switch bridge, it will find this dma addr is inside its
mmio memory window, then UR(unsupported request) error will be issued.

To resolve this problem, the only way is to let ep cannot use these mmio
windows, IOW, we should do something to prevent slab use these mmio's
physical addresses. So, if there is a reserved memory area below 4G, we can use it as
PCIe's outbound/mmio's dma address.

Regards
Li

> -----Original Message-----
> From: Li Chen
> Sent: Wednesday, November 3, 2021 6:04 PM
> To: 'Bjorn Helgaas'
> Cc: 'Keith Busch'; 'linux-pci@vger.kernel.org'; 'Lorenzo Pieralisi'; 'Rob Herring';
> 'kw@linux.com'; 'Bjorn Helgaas'; 'linux-kernel@vger.kernel.org'; 'Tom Joseph';
> 'Jens Axboe'; 'Christoph Hellwig'; 'Sagi Grimberg'; 'linux-
> nvme@lists.infradead.org'
> Subject: RE: [EXT] Re: nvme may get timeout from dd when using different non-
> prefetch mmio outbound/ranges
> 
> 
> 
> > -----Original Message-----
> > From: Li Chen
> > Sent: Tuesday, November 2, 2021 3:12 PM
> > To: Bjorn Helgaas
> > Cc: Keith Busch; linux-pci@vger.kernel.org; Lorenzo Pieralisi; Rob Herring;
> > kw@linux.com; Bjorn Helgaas; linux-kernel@vger.kernel.org; Tom Joseph; Jens
> > Axboe; Christoph Hellwig; Sagi Grimberg; linux-nvme@lists.infradead.org
> > Subject: RE: [EXT] Re: nvme may get timeout from dd when using different
> non-
> > prefetch mmio outbound/ranges
> >
> >
> >
> > > -----Original Message-----
> > > From: Li Chen
> > > Sent: Tuesday, November 2, 2021 1:18 PM
> > > To: Bjorn Helgaas
> > > Cc: Keith Busch; linux-pci@vger.kernel.org; Lorenzo Pieralisi; Rob Herring;
> > > kw@linux.com; Bjorn Helgaas; linux-kernel@vger.kernel.org; Tom Joseph;
> Jens
> > > Axboe; Christoph Hellwig; Sagi Grimberg; linux-nvme@lists.infradead.org
> > > Subject: RE: [EXT] Re: nvme may get timeout from dd when using different
> > non-
> > > prefetch mmio outbound/ranges
> > >
> > > Hi, Bjorn
> > >
> > > > -----Original Message-----
> > > > From: Bjorn Helgaas [mailto:helgaas@kernel.org]
> > > > Sent: Saturday, October 30, 2021 3:43 AM
> > > > To: Li Chen
> > > > Cc: Keith Busch; linux-pci@vger.kernel.org; Lorenzo Pieralisi; Rob Herring;
> > > > kw@linux.com; Bjorn Helgaas; linux-kernel@vger.kernel.org; Tom Joseph;
> > Jens
> > > > Axboe; Christoph Hellwig; Sagi Grimberg; linux-nvme@lists.infradead.org
> > > > Subject: Re: [EXT] Re: nvme may get timeout from dd when using different
> > > non-
> > > > prefetch mmio outbound/ranges
> > > >
> > > > On Fri, Oct 29, 2021 at 10:52:37AM +0000, Li Chen wrote:
> > > > > > -----Original Message-----
> > > > > > From: Keith Busch [mailto:kbusch@kernel.org]
> > > > > > Sent: Tuesday, October 26, 2021 12:16 PM
> > > > > > To: Li Chen
> > > > > > Cc: Bjorn Helgaas; linux-pci@vger.kernel.org; Lorenzo Pieralisi; Rob
> > Herring;
> > > > > > kw@linux.com; Bjorn Helgaas; linux-kernel@vger.kernel.org; Tom
> Joseph;
> > > > Jens
> > > > > > Axboe; Christoph Hellwig; Sagi Grimberg; linux-
> nvme@lists.infradead.org
> > > > > > Subject: Re: [EXT] Re: nvme may get timeout from dd when using
> > different
> > > > non-
> > > > > > prefetch mmio outbound/ranges
> > > > > >
> > > > > > On Tue, Oct 26, 2021 at 03:40:54AM +0000, Li Chen wrote:
> > > > > > > My nvme is " 05:00.0 Non-Volatile memory controller: Samsung
> > Electronics
> > > > Co
> > > > > > Ltd NVMe SSD Controller 980". From its datasheet,
> > > > > > https://urldefense.com/v3/__https://s3.ap-northeast-
> > > > > >
> > > >
> > >
> >
> 2.amazonaws.com/global.semi.static/Samsung_NVMe_SSD_980_Data_Sheet_R
> > > > > >
> > > >
> > >
> >
> ev.1.1.pdf__;!!PeEy7nZLVv0!3MU3LdTWuzON9JMUkq29zwJM4d7g7wKtkiZszTu-
> > > > > > PVepWchI_uLHpQGgdR_LEZM$ , it says nothing about CMB/SQEs, so I'm
> > > not
> > > > sure.
> > > > > > Is there other ways/tools(like nvme-cli) to query?
> > > > > >
> > > > > > The driver will export a sysfs property for it if it is supported:
> > > > > >
> > > > > >   # cat /sys/class/nvme/nvme0/cmb
> > > > > >
> > > > > > If the file doesn't exist, then /dev/nvme0 doesn't have the capability.
> > > > > >
> > > > > > > > > I don't know how to interpret "ranges".  Can you supply the dmesg
> > > and
> > > > > > > > > "lspci -vvs 0000:05:00.0" output both ways, e.g.,
> > > > > > > > >
> > > > > > > > >   pci_bus 0000:00: root bus resource [mem 0x7f800000-0xefffffff
> > > > window]
> > > > > > > > >   pci_bus 0000:00: root bus resource [mem 0xfd000000-0xfe7fffff
> > > > window]
> > > > > > > > >   pci 0000:05:00.0: [vvvv:dddd] type 00 class 0x...
> > > > > > > > >   pci 0000:05:00.0: reg 0x10: [mem 0x.....000-0x.....fff ...]
> > > > > > > > >
> > > > > > > > > > Question:
> > > > > > > > > > 1.  Why dd can cause nvme timeout? Is there more debug ways?
> > > > > > > >
> > > > > > > > That means the nvme controller didn't provide a response to a
> posted
> > > > > > > > command within the driver's latency tolerance.
> > > > > > >
> > > > > > > FYI, with the help of pci bridger's vendor, they find something
> > > > > > > interesting:
> > > > > > "From catc log, I saw some memory read pkts sent from SSD card,
> > > > > > but its memory range is within the memory range of switch down
> > > > > > port. So, switch down port will replay UR pkt. It seems not
> > > > > > normal." and "Why SSD card send out some memory pkts which
> memory
> > > > > > address is within switch down port's memory range. If so, switch
> > > > > > will response UR pkts". I also don't understand how can this
> > > > > > happen?
> > > > > >
> > > > > > I think we can safely assume you're not attempting peer-to-peer,
> > > > > > so that behavior as described shouldn't be happening. It sounds
> > > > > > like the memory windows may be incorrect. The dmesg may help to
> > > > > > show if something appears wrong.
> > > > >
> > > > > Agree that here doesn't involve peer-to-peer DMA. After conforming
> > > > > from switch vendor today, the two ur(unsupported request) is because
> > > > > nvme is trying to dma read dram with bus address 80d5000 and
> > > > > 80d5100. But the two bus addresses are located in switch's down port
> > > > > range, so the switch down port report ur.
> > > > >
> > > > > In our soc, dma/bus/pci address and physical/AXI address are 1:1,
> > > > > and DRAM space in physical memory address space is 000000.0000 -
> > > > > 0fffff.ffff 64G, so bus address 80d5000 and 80d5100 to cpu address
> > > > > are also 80d5000 and 80d5100, which both located inside dram space.
> > > > >
> > > > > Both our bootloader and romcode don't enum and configure pcie
> > > > > devices and switches, so the switch cfg stage should be left to
> > > > > kernel.
> > > > >
> > > > > Come back to the subject of this thread: " nvme may get timeout from
> > > > > dd when using different non-prefetch mmio outbound/ranges". I found:
> > > > >
> > > > > 1. For <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x04000000>;
> > > > > (which will timeout nvme)
> > > > >
> > > > > Switch(bridge of nvme)'s resource window:
> > > > > Memory behind bridge: Memory behind bridge: 08000000-080fffff
> [size=1M]
> > > > >
> > > > > 80d5000 and 80d5100 are both inside this range.
> > > >
> > > > The PCI host bridge MMIO window is here:
> > > >
> > > >   pci_bus 0000:00: root bus resource [mem 0x2008000000-0x200bffffff] (bus
> > > > address [0x08000000-0x0bffffff])
> > > >   pci 0000:01:00.0: PCI bridge to [bus 02-05]
> > > >   pci 0000:01:00.0:   bridge window [mem 0x2008000000-0x20080fffff]
> > > >   pci 0000:02:06.0: PCI bridge to [bus 05]
> > > >   pci 0000:02:06.0:   bridge window [mem 0x2008000000-0x20080fffff]
> > > >   pci 0000:05:00.0: BAR 0: assigned [mem 0x2008000000-0x2008003fff 64bit]
> > > >
> > > > So bus address [0x08000000-0x0bffffff] is the area used for PCI BARs.
> > > > If the NVMe device is generating DMA transactions to 0x080d5000, which
> > > > is inside that range, those will be interpreted as peer-to-peer
> > > > transactions.  But obviously that's not intended and there's no device
> > > > at 0x080d5000 anyway.
> > > >
> > > > My guess is the nvme driver got 0x080d5000 from the DMA API, e.g.,
> > > > dma_map_bvec() or dma_map_sg_attrs(), so maybe there's something
> > wrong
> > > > in how that's set up.  Is there an IOMMU?  There should be arch code
> > > > that knows what RAM is available for DMA buffers, maybe based on the
> > > > DT.  I'm not really familiar with how all that would be arranged, but
> > > > the complete dmesg log and complete DT might have a clue.  Can you
> > > > post those somewhere?
> > >
> > > After some printk, I found nvme_pci_setup_prps get some dma addresses
> > inside
> > > switch's memory window from sg, but I don't where the sg is from(see
> > > comments in following source codes):
> >
> > I just noticed it should come from mempool_alloc in nvme_map_data:
> > static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request
> *req,
> > 		struct nvme_command *cmnd)
> > {
> > 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> > 	blk_status_t ret = BLK_STS_RESOURCE;
> > 	int nr_mapped;
> >
> > 	if (blk_rq_nr_phys_segments(req) == 1) {
> > 		struct bio_vec bv = req_bvec(req);
> >
> > 		if (!is_pci_p2pdma_page(bv.bv_page)) {
> > 			if (bv.bv_offset + bv.bv_len <= NVME_CTRL_PAGE_SIZE
> > * 2)
> > 				return nvme_setup_prp_simple(dev, req,
> > 							     &cmnd->rw, &bv);
> >
> > 			if (iod->nvmeq->qid &&
> > 			    dev->ctrl.sgls & ((1 << 0) | (1 << 1)))
> > 				return nvme_setup_sgl_simple(dev, req,
> > 							     &cmnd->rw, &bv);
> > 		}
> > 	}
> >
> > 	iod->dma_len = 0;
> > 	iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
> >
> >
> > 	unsigned int l = sg_dma_address(iod->sg), r = sg_dma_address(iod->sg)
> > + sg_dma_len(iod->sg), tl = 0x08000000, tr = tl + 0x04000000, ntl = 0x00400000,
> ntr
> > = ntl + 0x08000000;
> >
> > /*
> > # dmesg | grep "region-timeout ? 1" | grep nvme_map_data
> > [   16.002446] lchen nvme_map_data, 895: first_dma 0, end_dma 8b1c21c,
> inside
> > region-timeout ? 1, is inside region-non-timeout ? 0
> > [   16.341240] lchen nvme_map_data, 895: first_dma 0, end_dma 8b1a21c,
> inside
> > region-timeout ? 1, is inside region-non-timeout ? 0
> > [   16.405938] lchen nvme_map_data, 895: first_dma 0, end_dma 8b1821c,
> inside
> > region-timeout ? 1, is inside region-non-timeout ? 0
> > [   36.126917] lchen nvme_map_data, 895: first_dma 0, end_dma 8f2421c, inside
> > region-timeout ? 1, is inside region-non-timeout ? 0
> > [   36.703839] lchen nvme_map_data, 895: first_dma 0, end_dma 8f2221c, inside
> > region-timeout ? 1, is inside region-non-timeout ? 0
> > [   38.510086] lchen nvme_map_data, 895: first_dma 0, end_dma 89c621c,
> inside
> > region-timeout ? 1, is inside region-non-timeout ? 0
> > [   40.542394] lchen nvme_map_data, 895: first_dma 0, end_dma 89c421c,
> inside
> > region-timeout ? 1, is inside region-non-timeout ? 0
> > [   40.573405] lchen nvme_map_data, 895: first_dma 0, end_dma 87ee21c,
> inside
> > region-timeout ? 1, is inside region-non-timeout ? 0
> > [   40.604419] lchen nvme_map_data, 895: first_dma 0, end_dma 87ec21c,
> inside
> > region-timeout ? 1, is inside region-non-timeout ? 0
> > [   40.874395] lchen nvme_map_data, 895: first_dma 0, end_dma 8aa221c,
> inside
> > region-timeout ? 1, is inside region-non-timeout ? 0
> > [   40.905323] lchen nvme_map_data, 895: first_dma 0, end_dma 8aa021c,
> inside
> > region-timeout ? 1, is inside region-non-timeout ? 0
> > [   40.968675] lchen nvme_map_data, 895: first_dma 0, end_dma 8a0e21c,
> inside
> > region-timeout ? 1, is inside region-non-timeout ? 0
> > [   40.999659] lchen nvme_map_data, 895: first_dma 0, end_dma 8a0821c,
> inside
> > region-timeout ? 1, is inside region-non-timeout ? 0
> > [   41.030601] lchen nvme_map_data, 895: first_dma 0, end_dma 8bde21c,
> inside
> > region-timeout ? 1, is inside region-non-timeout ? 0
> > [   41.061629] lchen nvme_map_data, 895: first_dma 0, end_dma 8b1e21c,
> inside
> > region-timeout ? 1, is inside region-non-timeout ? 0
> > [   41.092598] lchen nvme_map_data, 895: first_dma 0, end_dma 8eb621c,
> inside
> > region-timeout ? 1, is inside region-non-timeout ? 0
> > [   41.123677] lchen nvme_map_data, 895: first_dma 0, end_dma 8eb221c,
> inside
> > region-timeout ? 1, is inside region-non-timeout ? 0
> > [   41.160960] lchen nvme_map_data, 895: first_dma 0, end_dma 8cee21c,
> inside
> > region-timeout ? 1, is inside region-non-timeout ? 0
> > [   41.193609] lchen nvme_map_data, 895: first_dma 0, end_dma 8cec21c,
> inside
> > region-timeout ? 1, is inside region-non-timeout ? 0
> > [   41.224607] lchen nvme_map_data, 895: first_dma 0, end_dma 8c7e21c,
> inside
> > region-timeout ? 1, is inside region-non-timeout ? 0
> > [   41.255592] lchen nvme_map_data, 895: first_dma 0, end_dma 8c7c21c, inside
> > region-timeout ? 1, is inside region-non-timeout ? 0
> > [   41.286594] lchen nvme_map_data, 895: first_dma 0, end_dma 8c7821c,
> inside
> > region-timeout ? 1, is inside region-non-timeout ? 0
> > */
> > 	printk("lchen %s, %d: first_dma %llx, end_dma %llx, inside region-
> > timeout ? %d, is inside region-non-timeout ? %d", __func__, __LINE__, l, r, l >=
> tl
> > && l <= tr || r >= tl && r <= tr, l >= ntl && l <= ntr || r >= ntl && r <= ntr);
> >
> >
> >
> > 	//printk("lchen %s %d, addr starts %llx, ends %llx", __func__, __LINE__,
> > sg_dma_address(iod->sg), sg_dma_address(iod->sg) + sg_dma_len(iod->sg));
> >
> >     ................
> > }
> >
> > >
> > > static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
> > > 		struct request *req, struct nvme_rw_command *cmnd)
> > > {
> > > 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> > > 	struct dma_pool *pool;
> > > 	int length = blk_rq_payload_bytes(req);
> > > 	struct scatterlist *sg = iod->sg;
> > > 	int dma_len = sg_dma_len(sg);
> > > 	u64 dma_addr = sg_dma_address(sg);
> > >                 ......
> > > 	for (;;) {
> > > 		if (i == NVME_CTRL_PAGE_SIZE >> 3) {
> > > 			__le64 *old_prp_list = prp_list;
> > > 			prp_list = dma_pool_alloc(pool, GFP_ATOMIC,
> > > &prp_dma);
> > > 			printk("lchen %s %d dma pool %llx", __func__, __LINE__,
> > > prp_dma);
> > > 			if (!prp_list)
> > > 				goto free_prps;
> > > 			list[iod->npages++] = prp_list;
> > > 			prp_list[0] = old_prp_list[i - 1];
> > > 			old_prp_list[i - 1] = cpu_to_le64(prp_dma);
> > > 			i = 1;
> > > 		}
> > > 		prp_list[i++] = cpu_to_le64(dma_addr);
> > > 		dma_len -= NVME_CTRL_PAGE_SIZE;
> > > 		dma_addr += NVME_CTRL_PAGE_SIZE;
> > > 		length -= NVME_CTRL_PAGE_SIZE;
> > > 		if (length <= 0)
> > > 			break;
> > > 		if (dma_len > 0)
> > > 			continue;
> > > 		if (unlikely(dma_len < 0))
> > > 			goto bad_sgl;
> > > 		sg = sg_next(sg);
> > > 		dma_addr = sg_dma_address(sg);
> > > 		dma_len = sg_dma_len(sg);
> > >
> > >
> > > 		// XXX: Here get the following output, the region is inside bridge's
> > > window 08000000-080fffff [size=1M]
> > > 		/*
> > >
> > > # dmesg | grep " 80" | grep -v "  80"
> > > [    0.000476] Console: colour dummy device 80x25
> > > [   79.331766] lchen dma nvme_pci_setup_prps 708 addr 80ba000, end addr
> > > 80bc000
> > > [   79.815469] lchen dma nvme_pci_setup_prps 708 addr 8088000, end addr
> > > 8090000
> > > [  111.562129] lchen dma nvme_pci_setup_prps 708 addr 8088000, end addr
> > > 8090000
> > > [  111.873690] lchen dma nvme_pci_setup_prps 708 addr 80ba000, end addr
> > > 80bc000
> > > 			 * * */
> > > 		printk("lchen dma %s %d addr %llx, end addr %llx", __func__,
> > > __LINE__, dma_addr, dma_addr + dma_len);
> > > 	}
> > > done:
> > > 	cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
> > > 	cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
> > > 	return BLK_STS_OK;
> > > free_prps:
> > > 	nvme_free_prps(dev, req);
> > > 	return BLK_STS_RESOURCE;
> > > bad_sgl:
> > > 	WARN(DO_ONCE(nvme_print_sgl, iod->sg, iod->nents),
> > > 			"Invalid SGL for payload:%d nents:%d\n",
> > > 			blk_rq_payload_bytes(req), iod->nents);
> > > 	return BLK_STS_IOERR;
> > > }
> > >
> > > Backtrace of this function:
> > > # entries-in-buffer/entries-written: 1574/1574   #P:2
> > > #
> > > #                                _-----=> irqs-off
> > > #                               / _----=> need-resched
> > > #                              | / _---=> hardirq/softirq
> > > #                              || / _--=> preempt-depth
> > > #                              ||| /     delay
> > > #           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
> > > #              | |         |   ||||      |         |
> > >     kworker/u4:0-7       [000] ...1    40.095494: nvme_queue_rq <-
> > > blk_mq_dispatch_rq_list
> > >     kworker/u4:0-7       [000] ...1    40.095503: <stack trace>
> > >  => nvme_queue_rq
> > >  => blk_mq_dispatch_rq_list
> > >  => __blk_mq_do_dispatch_sched
> > >  => __blk_mq_sched_dispatch_requests
> > >  => blk_mq_sched_dispatch_requests
> > >  => __blk_mq_run_hw_queue
> > >  => __blk_mq_delay_run_hw_queue
> > >  => blk_mq_run_hw_queue
> > >  => blk_mq_sched_insert_requests
> > >  => blk_mq_flush_plug_list
> > >  => blk_flush_plug_list
> > >  => blk_mq_submit_bio
> > >  => __submit_bio_noacct_mq
> > >  => submit_bio_noacct
> > >  => submit_bio
> > >  => submit_bh_wbc.constprop.0
> > >  => __block_write_full_page
> > >  => block_write_full_page
> > >  => blkdev_writepage
> > >  => __writepage
> > >  => write_cache_pages
> > >  => generic_writepages
> > >  => blkdev_writepages
> > >  => do_writepages
> > >  => __writeback_single_inode
> > >  => writeback_sb_inodes
> > >  => __writeback_inodes_wb
> > >  => wb_writeback
> > >  => wb_do_writeback
> > >  => wb_workfn
> > >  => process_one_work
> > >  => worker_thread
> > >  => kthread
> > >  => ret_from_fork
> > >
> > >
> > > We don't have IOMMU and just have 1:1 mapping dma outbound.
> > >
> > >
> > > Here is the whole dmesg output(without my debug log):
> > > https://paste.debian.net/1217721/
> 
> 
> From dmesg:
> 
> 
> [    0.000000] Zone ranges:
> [    0.000000]   DMA      [mem 0x0000000001200000-0x00000000ffffffff]
> [    0.000000]   DMA32    empty
> [    0.000000]   Normal   empty
> [    0.000000] Movable zone start for each node
> [    0.000000] Early memory node ranges
> [    0.000000]   node   0: [mem 0x0000000001200000-0x00000000ffffffff]
> [    0.000000] Initmem setup node 0 [mem 0x0000000001200000-
> 0x00000000ffffffff]
> [    0.000000] On node 0 totalpages: 1043968
> [    0.000000]   DMA zone: 16312 pages used for memmap
> [    0.000000]   DMA zone: 0 pages reserved
> [    0.000000]   DMA zone: 1043968 pages, LIFO batch:63
> 
> I only has one zone: ZONE_DMA, and it includes all physical memory:
> 
> # cat /proc/iomem
> 01200000-ffffffff : System RAM
>   01280000-01c3ffff : Kernel code
>   01c40000-01ebffff : reserved
>   01ec0000-0201ffff : Kernel data
>   05280000-0528ffff : reserved
>   62000000-65ffffff : reserved
>   66186000-66786fff : reserved
>   66787000-667c0fff : reserved
>   667c3000-667c3fff : reserved
>   667c4000-667c5fff : reserved
>   667c6000-667c8fff : reserved
>   667c9000-ffffffff : reserved
> 2000000000-2000000fff : cfg
> 2008000000-200bffffff : pcie-controller@2040000000
>   2008000000-20081fffff : PCI Bus 0000:01
>     2008000000-20080fffff : PCI Bus 0000:02
>       2008000000-20080fffff : PCI Bus 0000:05
>         2008000000-2008003fff : 0000:05:00.0
>           2008000000-2008003fff : nvme
>     2008100000-200817ffff : 0000:01:00.0
> 
> 
> 
> From nvme codes:
> 
> dev->iod_mempool = mempool_create_node(1, mempool_kmalloc,
> 						mempool_kfree,
> 						(void *) alloc_size,
> 						GFP_KERNEL, node);
> 
> So mempool here just uses kmalloc and GFP_KERNEL to get physical memory.
> 
> Does it mean that I should re-arrange zone_dma/zone_dma/zone_normal(like
> limit zone_dma to mem region that doesn't include pcie mmio) to fix this problem?
> 
> 
> 
> > > Here is our dtsi: https://paste.debian.net/1217723/
> > > >
> > > > > 2. For <0x02000000 0x00 0x00400000 0x20 0x00400000 0x00 0x08000000>;
> > > > > (which make nvme not timeout)
> > > > >
> > > > > Switch(bridge of nvme)'s resource window:
> > > > > Memory behind bridge: Memory behind bridge: 00400000-004fffff
> [size=1M]
> > > > >
> > > > > 80d5000 and 80d5100 are not inside this range, so if nvme tries to
> > > > > read 80d5000 and 80d5100 , ur won't happen.
> > > > >
> > > > > From /proc/iomen:
> > > > > # cat /proc/iomem
> > > > > 01200000-ffffffff : System RAM
> > > > >   01280000-022affff : Kernel code
> > > > >   022b0000-0295ffff : reserved
> > > > >   02960000-040cffff : Kernel data
> > > > >   05280000-0528ffff : reserved
> > > > >   41cc0000-422c0fff : reserved
> > > > >   422c1000-4232afff : reserved
> > > > >   4232d000-667bbfff : reserved
> > > > >   667bc000-667bcfff : reserved
> > > > >   667bd000-667c0fff : reserved
> > > > >   667c1000-ffffffff : reserved
> > > > > 2000000000-2000000fff : cfg
> > > > >
> > > > > No one uses 0000000-1200000, so " Memory behind bridge: Memory
> > > > > behind bridge: 00400000-004fffff [size=1M]" will never have any
> > > > > problem(because 0x1200000 > 0x004fffff).
> > > > >
> > > > > Above answers the question in Subject, one question left: what's the
> > > > > right way to resolve this problem? Use ranges property to configure
> > > > > switch memory window indirectly(just what I did)? Or something else?
> > > > >
> > > > > I don't think changing range property is the right way: If my PCIe
> > > > > topology becomes more complex and have more endpoints or switches,
> > > > > maybe I have to reserve more MMIO through range property(please
> > > > > correct me if I'm wrong), the end of switch's memory window may be
> > > > > larger than 0x01200000. In case getting ur again,  I must reserve
> > > > > more physical memory address for them(like change kernel start
> > > > > address 0x01200000 to 0x02000000), which will make my visible dram
> > > > > smaller(I have verified it with "free -m"), it is not acceptable.
> > > >
> > > > Right, I don't think changing the PCI ranges property is the right
> > > > answer.  I think it's just a coincidence that moving the host bridge
> > > > MMIO aperture happens to move it out of the way of the DMA to
> > > > 0x080d5000.
> > > >
> > > > As far as I can tell, the PCI core and the nvme driver are doing the
> > > > right things here, and the problem is something behind the DMA API.
> > > >
> > > > I think there should be something that removes the MMIO aperture bus
> > > > addresses, i.e., 0x08000000-0x0bffffff in the timeout case, from the
> > > > pool of memory available for DMA buffers.
> > > >
> > > > The MMIO aperture bus addresses in the non-timeout case,
> > > > 0x00400000-0x083fffff, are not included in the 0x01200000-0xffffffff
> > > > System RAM area, which would explain why a DMA buffer would never
> > > > overlap with it.
> > > >
> > > > Bjorn
> > >
> > > Regards,
> > > Li
> >
> > Regards,
> > Li
> 
> Regards,
> Li

**********************************************************************
This email and attachments contain Ambarella Proprietary and/or Confidential Information and is intended solely for the use of the individual(s) to whom it is addressed. Any unauthorized review, use, disclosure, distribute, copy, or print is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy all copies of the original message. Thank you.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-11-29  3:07 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-22 10:08 nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges Li Chen
2021-10-25  9:02 ` Li Chen
2021-10-25 15:47 ` Bjorn Helgaas
2021-10-25 16:21   ` Keith Busch
2021-10-26  3:40     ` [EXT] " Li Chen
2021-10-26  4:15       ` Keith Busch
2021-10-26  4:44         ` Li Chen
2021-10-29 10:52         ` Li Chen
2021-10-29 19:42           ` Bjorn Helgaas
2021-11-02  5:18             ` Li Chen
2021-11-02  7:12               ` Li Chen
2021-11-03 10:04                 ` Li Chen
2021-11-29  3:04                 ` Li Chen
2021-10-26  4:33   ` Li Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).