All of lore.kernel.org
 help / color / mirror / Atom feed
* bad IOPS when running multiple btest/fio in parallel
@ 2018-10-12  4:44 Yao Lin
  2018-10-12 14:39 ` Keith Busch
  2018-10-12 15:49 ` Bart Van Assche
  0 siblings, 2 replies; 6+ messages in thread
From: Yao Lin @ 2018-10-12  4:44 UTC (permalink / raw)


Today I changed to a much simpler setup and the same issue persists.

Directly connect 2 PCs (identical hardware) with a pair of 100G rNICs. Create a null block device on the target PC and configure it as the NVMeOF target. So, there is no switch or SSD in this setup. And this is a single FIO, not the 4 FIO in parallel I mentioned earlier.

Start fio test against that null block device from the host, the best IOPS is 1550K. That's the best IOPS after I try out many different QD, # of job, and CPU affinity setting. Run the same fio test on the target, I get 2250K IOPS (it jumps to 3650K when I increased the number of threads). ?

So it seems to me that Linux NVMe stack is quite good and can support 100Gb/s + throughput. But the same can not be said of the NVMeOF stack. Any tuning possible?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* bad IOPS when running multiple btest/fio in parallel
  2018-10-12  4:44 bad IOPS when running multiple btest/fio in parallel Yao Lin
@ 2018-10-12 14:39 ` Keith Busch
  2018-10-12 15:37   ` [EXT] " Yao Lin
  2018-10-12 15:49 ` Bart Van Assche
  1 sibling, 1 reply; 6+ messages in thread
From: Keith Busch @ 2018-10-12 14:39 UTC (permalink / raw)


On Fri, Oct 12, 2018@04:44:22AM +0000, Yao Lin wrote:
> Today I changed to a much simpler setup and the same issue persists.
> 
> Directly connect 2 PCs (identical hardware) with a pair of 100G rNICs. Create a null block device on the target PC and configure it as the NVMeOF target. So, there is no switch or SSD in this setup. And this is a single FIO, not the 4 FIO in parallel I mentioned earlier.
> 
> Start fio test against that null block device from the host, the best IOPS is 1550K. That's the best IOPS after I try out many different QD, # of job, and CPU affinity setting. Run the same fio test on the target, I get 2250K IOPS (it jumps to 3650K when I increased the number of threads). ?
> 
> So it seems to me that Linux NVMe stack is quite good and can support 100Gb/s + throughput. But the same can not be said of the NVMeOF stack. Any tuning possible?

You're sure it's the software stack? Need to check your CPU utilization to
see if that's a possibility.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [EXT] Re: bad IOPS when running multiple btest/fio in parallel
  2018-10-12 14:39 ` Keith Busch
@ 2018-10-12 15:37   ` Yao Lin
  0 siblings, 0 replies; 6+ messages in thread
From: Yao Lin @ 2018-10-12 15:37 UTC (permalink / raw)


I monitor the CPU usage during all these tests. I have a powerful CPU (i9-7940X) and none of its cores ever reach 80% load.

-----Original Message-----
From: Keith Busch [mailto:keith.busch@intel.com] 
Sent: Friday, October 12, 2018 7:39 AM
To: Yao Lin <yaolin at marvell.com>
Cc: linux-nvme at lists.infradead.org
Subject: [EXT] Re: bad IOPS when running multiple btest/fio in parallel

External Email

----------------------------------------------------------------------
On Fri, Oct 12, 2018@04:44:22AM +0000, Yao Lin wrote:
> Today I changed to a much simpler setup and the same issue persists.
> 
> Directly connect 2 PCs (identical hardware) with a pair of 100G rNICs. Create a null block device on the target PC and configure it as the NVMeOF target. So, there is no switch or SSD in this setup. And this is a single FIO, not the 4 FIO in parallel I mentioned earlier.
> 
> Start fio test against that null block device from the host, the best 
> IOPS is 1550K. That's the best IOPS after I try out many different QD, 
> # of job, and CPU affinity setting. Run the same fio test on the 
> target, I get 2250K IOPS (it jumps to 3650K when I increased the 
> number of threads). ?
> 
> So it seems to me that Linux NVMe stack is quite good and can support 100Gb/s + throughput. But the same can not be said of the NVMeOF stack. Any tuning possible?

You're sure it's the software stack? Need to check your CPU utilization to see if that's a possibility.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* bad IOPS when running multiple btest/fio in parallel
  2018-10-12  4:44 bad IOPS when running multiple btest/fio in parallel Yao Lin
  2018-10-12 14:39 ` Keith Busch
@ 2018-10-12 15:49 ` Bart Van Assche
  2018-10-12 16:02   ` [EXT] " Yao Lin
  1 sibling, 1 reply; 6+ messages in thread
From: Bart Van Assche @ 2018-10-12 15:49 UTC (permalink / raw)


On Fri, 2018-10-12@04:44 +0000, Yao Lin wrote:
> Today I changed to a much simpler setup and the same issue persists.
> 
> Directly connect 2 PCs (identical hardware) with a pair of 100G rNICs.
> Create a null block device on the target PC and configure it as the
> NVMeOF target. So, there is no switch or SSD in this setup. And this is
> a single FIO, not the 4 FIO in parallel I mentioned earlier.
> 
> Start fio test against that null block device from the host, the best
> IOPS is 1550K. That's the best IOPS after I try out many different QD,
> # of job, and CPU affinity setting. Run the same fio test on the target,
> I get 2250K IOPS (it jumps to 3650K when I increased the number of
> threads). 
> 
> So it seems to me that Linux NVMe stack is quite good and can support
> 100Gb/s + throughput. But the same can not be said of the NVMeOF stack.
> Any tuning possible?

Many high-speed network adapters need multiple connections between
initiator and target to achieve line rate (typically 2-4 connections).
>From the NVMeOF initiator driver:

		set->nr_hw_queues = nctrl->queue_count - 1;

I think the "queue_count" parameter can be configured when creating a
connection. From the drivers/nvme/host/fabrics.c source file:

static const match_table_t opt_tokens = {
	[ ... ]
	{ NVMF_OPT_NR_IO_QUEUES,	"nr_io_queues=%d"	},
	[ ... ]
};

Have you tried to modify the nr_io_queues parameter? Have you verified
whether the 100G NICs you are using allocate multiple MSI/X vectors and
whether each vector has been assigned to another CPU?

Bart.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [EXT] Re: bad IOPS when running multiple btest/fio in parallel
  2018-10-12 15:49 ` Bart Van Assche
@ 2018-10-12 16:02   ` Yao Lin
  2018-10-15  7:50     ` Sagi Grimberg
  0 siblings, 1 reply; 6+ messages in thread
From: Yao Lin @ 2018-10-12 16:02 UTC (permalink / raw)


Thanks Bart. In my original post, I list the performance from 2 different 100G NICs. I worked with the engineer for the NIC that performs better. Their driver does support large number of IRQ which are assigned to all 28 CPUs in a round-robin manner. But even with this  design, that NIC can hit only 76Gb/s for RoCEv2 traffic. 

I haven't got the response from the other NIC vendor. Their RoCEv2 throughput has never exceed 55Gb/s. I will take a look at the source code.

-----Original Message-----
From: Bart Van Assche [mailto:bvanassche@acm.org] 
Sent: Friday, October 12, 2018 8:49 AM
To: Yao Lin ; linux-nvme at lists.infradead.org
Subject: [EXT] Re: bad IOPS when running multiple btest/fio in parallel

External Email

----------------------------------------------------------------------
On Fri, 2018-10-12@04:44 +0000, Yao Lin wrote:
> Today I changed to a much simpler setup and the same issue persists.
> 
> Directly connect 2 PCs (identical hardware) with a pair of 100G rNICs.
> Create a null block device on the target PC and configure it as the 
> NVMeOF target. So, there is no switch or SSD in this setup. And this 
> is a single FIO, not the 4 FIO in parallel I mentioned earlier.
> 
> Start fio test against that null block device from the host, the best 
> IOPS is 1550K. That's the best IOPS after I try out many different QD, 
> # of job, and CPU affinity setting. Run the same fio test on the 
> target, I get 2250K IOPS (it jumps to 3650K when I increased the 
> number of threads).
> 
> So it seems to me that Linux NVMe stack is quite good and can support 
> 100Gb/s + throughput. But the same can not be said of the NVMeOF stack.
> Any tuning possible?

Many high-speed network adapters need multiple connections between initiator and target to achieve line rate (typically 2-4 connections).
>From the NVMeOF initiator driver:

		set->nr_hw_queues = nctrl->queue_count - 1;

I think the "queue_count" parameter can be configured when creating a connection. From the drivers/nvme/host/fabrics.c source file:

static const match_table_t opt_tokens = {
	[ ... ]
	{ NVMF_OPT_NR_IO_QUEUES,	"nr_io_queues=%d"	},
	[ ... ]
};

Have you tried to modify the nr_io_queues parameter? Have you verified whether the 100G NICs you are using allocate multiple MSI/X vectors and whether each vector has been assigned to another CPU?

Bart.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [EXT] Re: bad IOPS when running multiple btest/fio in parallel
  2018-10-12 16:02   ` [EXT] " Yao Lin
@ 2018-10-15  7:50     ` Sagi Grimberg
  0 siblings, 0 replies; 6+ messages in thread
From: Sagi Grimberg @ 2018-10-15  7:50 UTC (permalink / raw)



> Thanks Bart. In my original post, I list the performance from 2 different 100G NICs. I worked with the engineer for the NIC that performs better. Their driver does support large number of IRQ which are assigned to all 28 CPUs in a round-robin manner. But even with this  design, that NIC can hit only 76Gb/s for RoCEv2 traffic.
> 
> I haven't got the response from the other NIC vendor. Their RoCEv2 throughput has never exceed 55Gb/s. I will take a look at the source code.

What kernel version are you running?

Do you happen to run irq balancer?

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-10-15  7:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-12  4:44 bad IOPS when running multiple btest/fio in parallel Yao Lin
2018-10-12 14:39 ` Keith Busch
2018-10-12 15:37   ` [EXT] " Yao Lin
2018-10-12 15:49 ` Bart Van Assche
2018-10-12 16:02   ` [EXT] " Yao Lin
2018-10-15  7:50     ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.