linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* IMX8MM PCIe performance evaluated with NVMe
@ 2021-12-03 21:52 Tim Harvey
  2021-12-03 22:18 ` Krzysztof Wilczyński
  2021-12-03 23:31 ` Keith Busch
  0 siblings, 2 replies; 5+ messages in thread
From: Tim Harvey @ 2021-12-03 21:52 UTC (permalink / raw)
  To: Jingoo Han, Gustavo Pimentel, Rob Herring, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Bjorn Helgaas, linux-pci, Richard Zhu

Greetings,

I'm using PCIe on the IMX8M Mini and testing PCIe performance with a
NVMe constrained to 1 lane. The NVMe in question is a Samsung SSD980
500GB which claims 3500MB/s read speed (with a gen3 x4 link).

My understanding of PCIe performance would give the following
theoretical max bandwidth based on clock and encoding:
pcie gen1 x1 : 2500MT/s*1lane*80% (8B/10B encoding) = 2000Mbps = 250MB/s
pcie gen2 x1 : 5000MT/s*1lane*80% (8B/10B encoding) = 4000Mbps = 500MB/s
pcie gen3 x1 : 8000MT/s*1lane*98.75% (128B/130B encoding) = 7900Mbps = 987.5MB/s
pcie gen3 x4 : 8000MT/s*4lane*98.75% (128B/130B encoding) = 31600Mbps = 3950MB/s

My assumption is an NVMe would have very little data overhead and thus
be a simple way to test PCIe bus performance.

Testing this NVMe with 'dd if=/dev/nvme0n1 of=/dev/null bs=1M
count=500 iflag=nocache' on various systems gives me the following:
- x86 gen3 x4: 2700MB/s (vs theoretical max of ~4GB/s)
- x86 gen3 x1: 840MB/s
- x86 gen2 x1: 390MB/s
- cn8030 gen3 x1: 352MB/s (Cavium OcteonTX)
- cn8030 gen2 x1: 193MB/s (Cavium OcteonTX)
- imx8mm gen2 x1: 266MB/s

The various x86 tests were not all done on the same PC or the same
kernel or kernel config... I used what I had around with whatever
Linux OS was on them just to get a feel for performance and in all
cases but the x4 case lanes 2/3/4 were masked off with kapton tape to
force a 1-lane link.

Why do you think the IMX8MM running at gen2 x1 would have such a lower
than expected performance (266MB/s vs the 390MB/s an x86 gen2 x1 could
get)?

What would a more appropriate way of testing PCIe performance be?

Best regards,

Tim

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: IMX8MM PCIe performance evaluated with NVMe
  2021-12-03 21:52 IMX8MM PCIe performance evaluated with NVMe Tim Harvey
@ 2021-12-03 22:18 ` Krzysztof Wilczyński
  2021-12-03 23:31 ` Keith Busch
  1 sibling, 0 replies; 5+ messages in thread
From: Krzysztof Wilczyński @ 2021-12-03 22:18 UTC (permalink / raw)
  To: Tim Harvey
  Cc: Jingoo Han, Gustavo Pimentel, Rob Herring, Lorenzo Pieralisi,
	Bjorn Helgaas, linux-pci, Richard Zhu, Jens Axboe

[+CC Jens as he is the block, I/O scheduler, NVMe, etc., maintainer]

Hi Tim,

[...]
> What would a more appropriate way of testing PCIe performance be?

I am adding Jens here for visibility as he does a lot of storage and I/O
performance testing on various platforms and with various hardware, he also
wrote fio[1], which I would recommend for testing over dd, and also is
a NVMe driver maintainer.  If he has a moment, then perhaps he could give
us some tips too.

1. https://github.com/axboe/fio

	Krzysztof

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: IMX8MM PCIe performance evaluated with NVMe
  2021-12-03 21:52 IMX8MM PCIe performance evaluated with NVMe Tim Harvey
  2021-12-03 22:18 ` Krzysztof Wilczyński
@ 2021-12-03 23:31 ` Keith Busch
  2021-12-15 16:26   ` Tim Harvey
  1 sibling, 1 reply; 5+ messages in thread
From: Keith Busch @ 2021-12-03 23:31 UTC (permalink / raw)
  To: Tim Harvey
  Cc: Jingoo Han, Gustavo Pimentel, Rob Herring, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Bjorn Helgaas, linux-pci, Richard Zhu

On Fri, Dec 03, 2021 at 01:52:17PM -0800, Tim Harvey wrote:
> Greetings,
> 
> I'm using PCIe on the IMX8M Mini and testing PCIe performance with a
> NVMe constrained to 1 lane. The NVMe in question is a Samsung SSD980
> 500GB which claims 3500MB/s read speed (with a gen3 x4 link).
> 
> My understanding of PCIe performance would give the following
> theoretical max bandwidth based on clock and encoding:
> pcie gen1 x1 : 2500MT/s*1lane*80% (8B/10B encoding) = 2000Mbps = 250MB/s
> pcie gen2 x1 : 5000MT/s*1lane*80% (8B/10B encoding) = 4000Mbps = 500MB/s
> pcie gen3 x1 : 8000MT/s*1lane*98.75% (128B/130B encoding) = 7900Mbps = 987.5MB/s
> pcie gen3 x4 : 8000MT/s*4lane*98.75% (128B/130B encoding) = 31600Mbps = 3950MB/s
> 
> My assumption is an NVMe would have very little data overhead and thus
> be a simple way to test PCIe bus performance.

Your 'dd' output is only reporting the user data throughput, but there
is more happening on the link than just user data.

You've accounted for the bit encoding, but there's more from the PCIe
protocol: the PHY layer (SOS), DLLP (Ack, FC), and TLP (headers,
sequences, checksums). 

NVMe itself also adds some overhead in the form of SQE, CQE, PRP, and
MSIx.

All told, the best theoretical bandwidth that user data will be able to
utilize out of the link is going to end up being ~85-90%, depending on
your PCIe MPS (Max Payload Size) setting.
 
> Testing this NVMe with 'dd if=/dev/nvme0n1 of=/dev/null bs=1M
> count=500 iflag=nocache' on various systems gives me the following:

If using 'dd', I think you want to use 'iflag=direct' rather than 'nocache'.

> - x86 gen3 x4: 2700MB/s (vs theoretical max of ~4GB/s)
> - x86 gen3 x1: 840MB/s
> - x86 gen2 x1: 390MB/s
> - cn8030 gen3 x1: 352MB/s (Cavium OcteonTX)
> - cn8030 gen2 x1: 193MB/s (Cavium OcteonTX)
> - imx8mm gen2 x1: 266MB/s
> 
> The various x86 tests were not all done on the same PC or the same
> kernel or kernel config... I used what I had around with whatever
> Linux OS was on them just to get a feel for performance and in all
> cases but the x4 case lanes 2/3/4 were masked off with kapton tape to
> force a 1-lane link.
> 
> Why do you think the IMX8MM running at gen2 x1 would have such a lower
> than expected performance (266MB/s vs the 390MB/s an x86 gen2 x1 could
> get)?
> 
> What would a more appropriate way of testing PCIe performance be?

Beyond the protocol overhead, 'dd' is probably not going to be the best
way to meausre a device's performance. This sends just one command at a
time, so you are also measuring the full software stack latency, which
includes a system call and interrupt driven context switches. The PCIe
traffic would be idle during this overhead when running at just qd1.

I am guessing your x86 is simply faster at executing through this
software stack than your imx8mm, so the software latency is lower.

A better approach may be to use higher queue depths with batched
submissions so that your software overhead can occur concurrently with
your PCIe traffic. Also, you can eliminate interrupt context switches if
you use polled IO queues.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: IMX8MM PCIe performance evaluated with NVMe
  2021-12-03 23:31 ` Keith Busch
@ 2021-12-15 16:26   ` Tim Harvey
  2021-12-15 16:51     ` Keith Busch
  0 siblings, 1 reply; 5+ messages in thread
From: Tim Harvey @ 2021-12-15 16:26 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jingoo Han, Gustavo Pimentel, Rob Herring, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Bjorn Helgaas, linux-pci, Richard Zhu,
	Barry Long

On Fri, Dec 3, 2021 at 3:31 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Fri, Dec 03, 2021 at 01:52:17PM -0800, Tim Harvey wrote:
> > Greetings,
> >
> > I'm using PCIe on the IMX8M Mini and testing PCIe performance with a
> > NVMe constrained to 1 lane. The NVMe in question is a Samsung SSD980
> > 500GB which claims 3500MB/s read speed (with a gen3 x4 link).
> >
> > My understanding of PCIe performance would give the following
> > theoretical max bandwidth based on clock and encoding:
> > pcie gen1 x1 : 2500MT/s*1lane*80% (8B/10B encoding) = 2000Mbps = 250MB/s
> > pcie gen2 x1 : 5000MT/s*1lane*80% (8B/10B encoding) = 4000Mbps = 500MB/s
> > pcie gen3 x1 : 8000MT/s*1lane*98.75% (128B/130B encoding) = 7900Mbps = 987.5MB/s
> > pcie gen3 x4 : 8000MT/s*4lane*98.75% (128B/130B encoding) = 31600Mbps = 3950MB/s
> >
> > My assumption is an NVMe would have very little data overhead and thus
> > be a simple way to test PCIe bus performance.
>
> Your 'dd' output is only reporting the user data throughput, but there
> is more happening on the link than just user data.
>
> You've accounted for the bit encoding, but there's more from the PCIe
> protocol: the PHY layer (SOS), DLLP (Ack, FC), and TLP (headers,
> sequences, checksums).
>
> NVMe itself also adds some overhead in the form of SQE, CQE, PRP, and
> MSIx.
>
> All told, the best theoretical bandwidth that user data will be able to
> utilize out of the link is going to end up being ~85-90%, depending on
> your PCIe MPS (Max Payload Size) setting.
>
> > Testing this NVMe with 'dd if=/dev/nvme0n1 of=/dev/null bs=1M
> > count=500 iflag=nocache' on various systems gives me the following:
>
> If using 'dd', I think you want to use 'iflag=direct' rather than 'nocache'.
>
> > - x86 gen3 x4: 2700MB/s (vs theoretical max of ~4GB/s)
> > - x86 gen3 x1: 840MB/s
> > - x86 gen2 x1: 390MB/s
> > - cn8030 gen3 x1: 352MB/s (Cavium OcteonTX)
> > - cn8030 gen2 x1: 193MB/s (Cavium OcteonTX)
> > - imx8mm gen2 x1: 266MB/s
> >
> > The various x86 tests were not all done on the same PC or the same
> > kernel or kernel config... I used what I had around with whatever
> > Linux OS was on them just to get a feel for performance and in all
> > cases but the x4 case lanes 2/3/4 were masked off with kapton tape to
> > force a 1-lane link.
> >
> > Why do you think the IMX8MM running at gen2 x1 would have such a lower
> > than expected performance (266MB/s vs the 390MB/s an x86 gen2 x1 could
> > get)?
> >
> > What would a more appropriate way of testing PCIe performance be?
>
> Beyond the protocol overhead, 'dd' is probably not going to be the best
> way to meausre a device's performance. This sends just one command at a
> time, so you are also measuring the full software stack latency, which
> includes a system call and interrupt driven context switches. The PCIe
> traffic would be idle during this overhead when running at just qd1.
>
> I am guessing your x86 is simply faster at executing through this
> software stack than your imx8mm, so the software latency is lower.
>
> A better approach may be to use higher queue depths with batched
> submissions so that your software overhead can occur concurrently with
> your PCIe traffic. Also, you can eliminate interrupt context switches if
> you use polled IO queues.

Thanks for the response!

The roughly 266MB/s performance results I've got on IMX8MM gen2 x1
using NVMe and plain old 'dd' is on par with what another has found
using a custom PCIe device of theirs and a simple loopback test so I
feel that the 'software stack' isn't the bottleneck here (as that's
removed in his situation). I'm leaning towards something like
interrupt latency. I'll have to dig into the NVMe device driver and
see if there is a way to hack it to poll to see what the difference
is.

Best regards,

Tim

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: IMX8MM PCIe performance evaluated with NVMe
  2021-12-15 16:26   ` Tim Harvey
@ 2021-12-15 16:51     ` Keith Busch
  0 siblings, 0 replies; 5+ messages in thread
From: Keith Busch @ 2021-12-15 16:51 UTC (permalink / raw)
  To: Tim Harvey
  Cc: Jingoo Han, Gustavo Pimentel, Rob Herring, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Bjorn Helgaas, linux-pci, Richard Zhu,
	Barry Long

On Wed, Dec 15, 2021 at 08:26:37AM -0800, Tim Harvey wrote:
> On Fri, Dec 3, 2021 at 3:31 PM Keith Busch <kbusch@kernel.org> wrote:
> > On Fri, Dec 03, 2021 at 01:52:17PM -0800, Tim Harvey wrote:
> > > What would a more appropriate way of testing PCIe performance be?
> >
> > Beyond the protocol overhead, 'dd' is probably not going to be the best
> > way to meausre a device's performance. This sends just one command at a
> > time, so you are also measuring the full software stack latency, which
> > includes a system call and interrupt driven context switches. The PCIe
> > traffic would be idle during this overhead when running at just qd1.
> >
> > I am guessing your x86 is simply faster at executing through this
> > software stack than your imx8mm, so the software latency is lower.
> >
> > A better approach may be to use higher queue depths with batched
> > submissions so that your software overhead can occur concurrently with
> > your PCIe traffic. Also, you can eliminate interrupt context switches if
> > you use polled IO queues.
> 
> Thanks for the response!
> 
> The roughly 266MB/s performance results I've got on IMX8MM gen2 x1
> using NVMe and plain old 'dd' is on par with what another has found
> using a custom PCIe device of theirs and a simple loopback test so I
> feel that the 'software stack' isn't the bottleneck here (as that's
> removed in his situation). I'm leaning towards something like
> interrupt latency. I'll have to dig into the NVMe device driver and
> see if there is a way to hack it to poll to see what the difference
> is.

You don't need to hack anything, the driver already supports polling.
You just need to enable the poll queues (they're off by default). For
example, you can turn on 2 polled queues with kernel parameter:

  nvme.poll_queues=2

After booting with that, you just need to submit IO with the HIPRI flag.
The 'dd' command can't do that, so I think you'll need to use 'fio'. An
example command that will run the same workload as your 'dd' example,
but with polling:

  fio --name=global --filename=/dev/nvme1n1 --rw=read --ioengine=pvsync2 --bs=1M --direct=1 --hipri --name=test

To verify that polling is actually happening, the fio output for "cpu"
stats should show something like "sys=99%".

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-12-15 16:51 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-03 21:52 IMX8MM PCIe performance evaluated with NVMe Tim Harvey
2021-12-03 22:18 ` Krzysztof Wilczyński
2021-12-03 23:31 ` Keith Busch
2021-12-15 16:26   ` Tim Harvey
2021-12-15 16:51     ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).