All of lore.kernel.org
 help / color / mirror / Atom feed
* bad IOPS when running multiple btest/fio in parallel
@ 2018-10-10 21:52 Yao Lin
  2018-10-15  7:55 ` Sagi Grimberg
  0 siblings, 1 reply; 5+ messages in thread
From: Yao Lin @ 2018-10-10 21:52 UTC (permalink / raw)


Host: Ubuntu 18.04 (4.15 kernel). I9-7940X (14C/28T) with 32G DRAM. Has a single-port 100G rNIC. No OFED driver is installed. 

1.	When I insert 4 Intel Optane 905P into that host and run 4 btest in parallel (one btest for each Optane, random read, bs=4K, 6 thread, qd = 32), I am able to get aggregated IOPS of 2380K.
2.	Then I move those 4 Optane into 4 NVMeOF targets (RoCEv2). Each target has a 25G rNIC. All 4 25G rNICs and that 100G rNIC are connected to a switch.
3.	Start iperf from all 4 targets toward the host, the aggregated throughput is 92Gbps. So this means the data path between the host and the targets is clean.
4.	From the host, use "nvme connect" to link up with all 4 targets.
5.	Run non-overlapping btest against each target, IOPS is around 595K each. So this is good.
6.	Run 4 btest in parallel (one btest for each target). This is basically the same as #1, except it's now over the fabric. But the aggregate IOPS is only 1500K. Assign CPU affinity so that each btest uses exclusive 3C/6T doesn't help. Replacing btest by fio doesn't help either.
7.	Replace that 100G rNIC by a model from a different vendor and repeat test #6. The aggregated IOPS is better, but it's still nowhere close to the expected 2380K IOPS.

So I am wondering if there is any known limitation with Linux inbox NVMeOF driver regarding support of multiple sessions in parallel. Any tuning?

Thanks,
Yao

^ permalink raw reply	[flat|nested] 5+ messages in thread

* bad IOPS when running multiple btest/fio in parallel
  2018-10-10 21:52 bad IOPS when running multiple btest/fio in parallel Yao Lin
@ 2018-10-15  7:55 ` Sagi Grimberg
  0 siblings, 0 replies; 5+ messages in thread
From: Sagi Grimberg @ 2018-10-15  7:55 UTC (permalink / raw)



> Host: Ubuntu 18.04 (4.15 kernel). I9-7940X (14C/28T) with 32G DRAM. Has a single-port 100G rNIC. No OFED driver is installed.
> 
> 1.	When I insert 4 Intel Optane 905P into that host and run 4 btest in parallel (one btest for each Optane, random read, bs=4K, 6 thread, qd = 32), I am able to get aggregated IOPS of 2380K.
> 2.	Then I move those 4 Optane into 4 NVMeOF targets (RoCEv2). Each target has a 25G rNIC. All 4 25G rNICs and that 100G rNIC are connected to a switch.
> 3.	Start iperf from all 4 targets toward the host, the aggregated throughput is 92Gbps. So this means the data path between the host and the targets is clean.
> 4.	From the host, use "nvme connect" to link up with all 4 targets.
> 5.	Run non-overlapping btest against each target, IOPS is around 595K each. So this is good.
> 6.	Run 4 btest in parallel (one btest for each target). This is basically the same as #1, except it's now over the fabric. But the aggregate IOPS is only 1500K. Assign CPU affinity so that each btest uses exclusive 3C/6T doesn't help. Replacing btest by fio doesn't help either.
> 7.	Replace that 100G rNIC by a model from a different vendor and repeat test #6. The aggregated IOPS is better, but it's still nowhere close to the expected 2380K IOPS.
> 
> So I am wondering if there is any known limitation with Linux inbox NVMeOF driver regarding support of multiple sessions in parallel. Any tuning?

Does setting modparam register_always=Y make a difference?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* bad IOPS when running multiple btest/fio in parallel
  2018-10-12  4:44 Yao Lin
  2018-10-12 14:39 ` Keith Busch
@ 2018-10-12 15:49 ` Bart Van Assche
  1 sibling, 0 replies; 5+ messages in thread
From: Bart Van Assche @ 2018-10-12 15:49 UTC (permalink / raw)


On Fri, 2018-10-12@04:44 +0000, Yao Lin wrote:
> Today I changed to a much simpler setup and the same issue persists.
> 
> Directly connect 2 PCs (identical hardware) with a pair of 100G rNICs.
> Create a null block device on the target PC and configure it as the
> NVMeOF target. So, there is no switch or SSD in this setup. And this is
> a single FIO, not the 4 FIO in parallel I mentioned earlier.
> 
> Start fio test against that null block device from the host, the best
> IOPS is 1550K. That's the best IOPS after I try out many different QD,
> # of job, and CPU affinity setting. Run the same fio test on the target,
> I get 2250K IOPS (it jumps to 3650K when I increased the number of
> threads). 
> 
> So it seems to me that Linux NVMe stack is quite good and can support
> 100Gb/s + throughput. But the same can not be said of the NVMeOF stack.
> Any tuning possible?

Many high-speed network adapters need multiple connections between
initiator and target to achieve line rate (typically 2-4 connections).
>From the NVMeOF initiator driver:

		set->nr_hw_queues = nctrl->queue_count - 1;

I think the "queue_count" parameter can be configured when creating a
connection. From the drivers/nvme/host/fabrics.c source file:

static const match_table_t opt_tokens = {
	[ ... ]
	{ NVMF_OPT_NR_IO_QUEUES,	"nr_io_queues=%d"	},
	[ ... ]
};

Have you tried to modify the nr_io_queues parameter? Have you verified
whether the 100G NICs you are using allocate multiple MSI/X vectors and
whether each vector has been assigned to another CPU?

Bart.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* bad IOPS when running multiple btest/fio in parallel
  2018-10-12  4:44 Yao Lin
@ 2018-10-12 14:39 ` Keith Busch
  2018-10-12 15:49 ` Bart Van Assche
  1 sibling, 0 replies; 5+ messages in thread
From: Keith Busch @ 2018-10-12 14:39 UTC (permalink / raw)


On Fri, Oct 12, 2018@04:44:22AM +0000, Yao Lin wrote:
> Today I changed to a much simpler setup and the same issue persists.
> 
> Directly connect 2 PCs (identical hardware) with a pair of 100G rNICs. Create a null block device on the target PC and configure it as the NVMeOF target. So, there is no switch or SSD in this setup. And this is a single FIO, not the 4 FIO in parallel I mentioned earlier.
> 
> Start fio test against that null block device from the host, the best IOPS is 1550K. That's the best IOPS after I try out many different QD, # of job, and CPU affinity setting. Run the same fio test on the target, I get 2250K IOPS (it jumps to 3650K when I increased the number of threads). ?
> 
> So it seems to me that Linux NVMe stack is quite good and can support 100Gb/s + throughput. But the same can not be said of the NVMeOF stack. Any tuning possible?

You're sure it's the software stack? Need to check your CPU utilization to
see if that's a possibility.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* bad IOPS when running multiple btest/fio in parallel
@ 2018-10-12  4:44 Yao Lin
  2018-10-12 14:39 ` Keith Busch
  2018-10-12 15:49 ` Bart Van Assche
  0 siblings, 2 replies; 5+ messages in thread
From: Yao Lin @ 2018-10-12  4:44 UTC (permalink / raw)


Today I changed to a much simpler setup and the same issue persists.

Directly connect 2 PCs (identical hardware) with a pair of 100G rNICs. Create a null block device on the target PC and configure it as the NVMeOF target. So, there is no switch or SSD in this setup. And this is a single FIO, not the 4 FIO in parallel I mentioned earlier.

Start fio test against that null block device from the host, the best IOPS is 1550K. That's the best IOPS after I try out many different QD, # of job, and CPU affinity setting. Run the same fio test on the target, I get 2250K IOPS (it jumps to 3650K when I increased the number of threads). ?

So it seems to me that Linux NVMe stack is quite good and can support 100Gb/s + throughput. But the same can not be said of the NVMeOF stack. Any tuning possible?

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-10-15  7:55 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-10 21:52 bad IOPS when running multiple btest/fio in parallel Yao Lin
2018-10-15  7:55 ` Sagi Grimberg
2018-10-12  4:44 Yao Lin
2018-10-12 14:39 ` Keith Busch
2018-10-12 15:49 ` Bart Van Assche

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.