All of lore.kernel.org
 help / color / mirror / Atom feed
* Maximum NVMe IO command size > 1MB?
@ 2016-01-06 19:23 Xuehua Chen
  2016-01-06 19:31 ` Keith Busch
  0 siblings, 1 reply; 9+ messages in thread
From: Xuehua Chen @ 2016-01-06 19:23 UTC (permalink / raw)


Hi, 

It seems to me kernel 4.3 supports NVMe IO command size > 512k after the following is added. 

blk_queue_max_segments(ns->queue,
       ((dev->max_hw_sectors << 9) / dev->page_size) + 1);

If I run the fllowing, 
fio --name=iotest --filename=/dev/nvme0n1 --iodepth=1 --ioengine=libaio --direct=1 --size=1M --bs=1M --rw=read

I can see one read with data transfer size 1MB is sent to device. 

But if I increase the bs to 2M as below, I still see two 1MB commands are sent out instead of one 2MB read command
fio --name=iotest --filename=/dev/nvme0n1 --iodepth=1 --ioengine=libaio --direct=1 --size=2M --bs=2M --rw=read

Is there any other settings in kernel that make it split a 2M command into two 1M commands? 

Thanks, 

Xuehua

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Maximum NVMe IO command size > 1MB?
  2016-01-06 19:23 Maximum NVMe IO command size > 1MB? Xuehua Chen
@ 2016-01-06 19:31 ` Keith Busch
  2016-01-06 19:51   ` Xuehua Chen
  0 siblings, 1 reply; 9+ messages in thread
From: Keith Busch @ 2016-01-06 19:31 UTC (permalink / raw)


On Wed, Jan 06, 2016@07:23:53PM +0000, Xuehua Chen wrote:
> It seems to me kernel 4.3 supports NVMe IO command size > 512k after the following is added. 
> 
> blk_queue_max_segments(ns->queue,
>        ((dev->max_hw_sectors << 9) / dev->page_size) + 1);
> 
> If I run the fllowing, 
> fio --name=iotest --filename=/dev/nvme0n1 --iodepth=1 --ioengine=libaio --direct=1 --size=1M --bs=1M --rw=read
> 
> I can see one read with data transfer size 1MB is sent to device. 
> 
> But if I increase the bs to 2M as below, I still see two 1MB commands are sent out instead of one 2MB read command
> fio --name=iotest --filename=/dev/nvme0n1 --iodepth=1 --ioengine=libaio --direct=1 --size=2M --bs=2M --rw=read
> 
> Is there any other settings in kernel that make it split a 2M command into two 1M commands? 

Is the device actually capable of 2MB transfers? You can confirm with:

  # cat /sys/block/nvme0n1/queue/max_hw_sectors_kb

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Maximum NVMe IO command size > 1MB?
  2016-01-06 19:31 ` Keith Busch
@ 2016-01-06 19:51   ` Xuehua Chen
  2016-01-06 21:56     ` Xuehua Chen
  2016-01-07 11:39     ` Sagi Grimberg
  0 siblings, 2 replies; 9+ messages in thread
From: Xuehua Chen @ 2016-01-06 19:51 UTC (permalink / raw)


The value is 2048, which seems to be 2MB.


________________________________________
From: Keith Busch [keith.busch@intel.com]
Sent: Wednesday, January 6, 2016 11:31 AM
To: Xuehua Chen
Cc: linux-nvme at lists.infradead.org
Subject: Re: Maximum NVMe IO command size > 1MB?

On Wed, Jan 06, 2016@07:23:53PM +0000, Xuehua Chen wrote:
> It seems to me kernel 4.3 supports NVMe IO command size > 512k after the following is added.
>
> blk_queue_max_segments(ns->queue,
>        ((dev->max_hw_sectors << 9) / dev->page_size) + 1);
>
> If I run the fllowing,
> fio --name=iotest --filename=/dev/nvme0n1 --iodepth=1 --ioengine=libaio --direct=1 --size=1M --bs=1M --rw=read
>
> I can see one read with data transfer size 1MB is sent to device.
>
> But if I increase the bs to 2M as below, I still see two 1MB commands are sent out instead of one 2MB read command
> fio --name=iotest --filename=/dev/nvme0n1 --iodepth=1 --ioengine=libaio --direct=1 --size=2M --bs=2M --rw=read
>
> Is there any other settings in kernel that make it split a 2M command into two 1M commands?

Is the device actually capable of 2MB transfers? You can confirm with:

  # cat /sys/block/nvme0n1/queue/max_hw_sectors_kb

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Maximum NVMe IO command size > 1MB?
  2016-01-06 19:51   ` Xuehua Chen
@ 2016-01-06 21:56     ` Xuehua Chen
  2016-01-06 22:54       ` Keith Busch
  2016-01-07 11:39     ` Sagi Grimberg
  1 sibling, 1 reply; 9+ messages in thread
From: Xuehua Chen @ 2016-01-06 21:56 UTC (permalink / raw)


Hi, Keith, 

I wonder whether this could be caused by BIO_MAX_PAGES defined as 256, which means 1MB at most.
What do you think?

Xuehua

________________________________________
From: Linux-nvme [linux-nvme-bounces@lists.infradead.org] on behalf of Xuehua Chen [xuehua@marvell.com]
Sent: Wednesday, January 6, 2016 11:51 AM
To: Keith Busch
Cc: linux-nvme at lists.infradead.org
Subject: RE: Maximum NVMe IO command size > 1MB?

The value is 2048, which seems to be 2MB.


________________________________________
From: Keith Busch [keith.busch@intel.com]
Sent: Wednesday, January 6, 2016 11:31 AM
To: Xuehua Chen
Cc: linux-nvme at lists.infradead.org
Subject: Re: Maximum NVMe IO command size > 1MB?

On Wed, Jan 06, 2016@07:23:53PM +0000, Xuehua Chen wrote:
> It seems to me kernel 4.3 supports NVMe IO command size > 512k after the following is added.
>
> blk_queue_max_segments(ns->queue,
>        ((dev->max_hw_sectors << 9) / dev->page_size) + 1);
>
> If I run the fllowing,
> fio --name=iotest --filename=/dev/nvme0n1 --iodepth=1 --ioengine=libaio --direct=1 --size=1M --bs=1M --rw=read
>
> I can see one read with data transfer size 1MB is sent to device.
>
> But if I increase the bs to 2M as below, I still see two 1MB commands are sent out instead of one 2MB read command
> fio --name=iotest --filename=/dev/nvme0n1 --iodepth=1 --ioengine=libaio --direct=1 --size=2M --bs=2M --rw=read
>
> Is there any other settings in kernel that make it split a 2M command into two 1M commands?

Is the device actually capable of 2MB transfers? You can confirm with:

  # cat /sys/block/nvme0n1/queue/max_hw_sectors_kb

_______________________________________________
Linux-nvme mailing list
Linux-nvme at lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Maximum NVMe IO command size > 1MB?
  2016-01-06 21:56     ` Xuehua Chen
@ 2016-01-06 22:54       ` Keith Busch
  2016-01-07 17:38         ` Xuehua Chen
  2016-01-10 22:16         ` Xuehua Chen
  0 siblings, 2 replies; 9+ messages in thread
From: Keith Busch @ 2016-01-06 22:54 UTC (permalink / raw)


On Wed, Jan 06, 2016@09:56:24PM +0000, Xuehua Chen wrote:
> Hi, Keith, 
> 
> I wonder whether this could be caused by BIO_MAX_PAGES defined as 256, which means 1MB at most.
> What do you think?

I think you got it. You're running O_DIRECT, and fs/direct-io.c,
dio_new_bio() allocates up to BIO_MAX_PAGES.

I can't tell where the value for came from (looks like it was there from
the very first git commit), but maybe you can propose raising it if you
set BIO_MAX_PAGES higher without issue.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Maximum NVMe IO command size > 1MB?
  2016-01-06 19:51   ` Xuehua Chen
  2016-01-06 21:56     ` Xuehua Chen
@ 2016-01-07 11:39     ` Sagi Grimberg
  2016-01-07 17:34       ` Xuehua Chen
  1 sibling, 1 reply; 9+ messages in thread
From: Sagi Grimberg @ 2016-01-07 11:39 UTC (permalink / raw)



> The value is 2048, which seems to be 2MB.

I think 2048 is 1MB...

max_transfer_size = max_sectors_kb * sector_size =
2048K * 512 = 1MB

I don't think it's coming from any other limitation in
the block layer since I'm able to transfer 8M and more in
a single request with iser.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Maximum NVMe IO command size > 1MB?
  2016-01-07 11:39     ` Sagi Grimberg
@ 2016-01-07 17:34       ` Xuehua Chen
  0 siblings, 0 replies; 9+ messages in thread
From: Xuehua Chen @ 2016-01-07 17:34 UTC (permalink / raw)


>I think 2048 is 1MB...
Could you double check?
>From https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt

max_hw_sectors_kb (RO)
----------------------
This is the maximum number of kilobytes supported in a single data transfer.

> I don't think it's coming from any other limitation in the block layer since I'm able to transfer 8M and more in a single request with iser.

Is this transfer done via a single bio? How did you determine the single transfer size?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Maximum NVMe IO command size > 1MB?
  2016-01-06 22:54       ` Keith Busch
@ 2016-01-07 17:38         ` Xuehua Chen
  2016-01-10 22:16         ` Xuehua Chen
  1 sibling, 0 replies; 9+ messages in thread
From: Xuehua Chen @ 2016-01-07 17:38 UTC (permalink / raw)


Thanks, will try raising it and see how it go. 
________________________________________
From: Keith Busch [keith.busch@intel.com]
Sent: Wednesday, January 6, 2016 2:54 PM
To: Xuehua Chen
Cc: linux-nvme at lists.infradead.org
Subject: Re: Maximum NVMe IO command size > 1MB?

On Wed, Jan 06, 2016@09:56:24PM +0000, Xuehua Chen wrote:
> Hi, Keith,
>
> I wonder whether this could be caused by BIO_MAX_PAGES defined as 256, which means 1MB at most.
> What do you think?

I think you got it. You're running O_DIRECT, and fs/direct-io.c,
dio_new_bio() allocates up to BIO_MAX_PAGES.

I can't tell where the value for came from (looks like it was there from
the very first git commit), but maybe you can propose raising it if you
set BIO_MAX_PAGES higher without issue.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Maximum NVMe IO command size > 1MB?
  2016-01-06 22:54       ` Keith Busch
  2016-01-07 17:38         ` Xuehua Chen
@ 2016-01-10 22:16         ` Xuehua Chen
  1 sibling, 0 replies; 9+ messages in thread
From: Xuehua Chen @ 2016-01-10 22:16 UTC (permalink / raw)


Yes, dio_new_bio() caused the splitting.

Tried raising BIO_MAX_PAGES to 512 and run the command below again.
fio --name=iotest --filename=/dev/nvme0n1 --iodepth=1 --ioengine=libaio --direct=1 --size=2M --bs=2M --rw=read

It is found one 1280K command and one 768K command are sent instead of two 1M commands. It seems new BIO_MAX_PAGES
takes effect and there is another factor cause the command to split. The splitting seems to be caused by the value of 
/sys/block/nvme0n1/queue/max_sectors_kb, which is 1280. After changing its value to 2048, one 2M command is sent. 
Also tried increasing iodepth to 512 and size to 1G and run multiple times, it runs well.

Below is the description of max_sectors_kb in queue-sysfs.txt

max_sectors_kb (RW)
-------------------
This is the maximum number of kilobytes that the block layer will allow
for a filesystem request. Must be smaller than or equal to the maximum
size allowed by the hardware.

It seems that BIO_MAX_PAGES and max_sectors_kb are two more factors that limit the maximum size of a transfer. 

One thing that caught my attention is max_sectors_kb is determined by BLk_DEF_MAX_SECTORS, which is defined as
2560 in blkdev.h. It seems that it does not show accurately the maximum size of a transfer, 1028KB for kernel 
4.3 due to the current value of BIO_MAX_PAGES, 256. 

Based on the findings, I would propose the below changes. 

1. Change BLK_DEF_MAX_SECTORS from 2560 to BIO_MAX_SECTORS(2048). 
2. Previously users can change max_sectors_kb to any value which is smaller than or equal to that of max_hw_sectors_kb.
Change the behavior so that users cannot change it to any value which is bigger than the minimum limit determined by both 
max_hw_sectors_kb and BIO_MAX_SECTORS.
3. Update queue-sysfs.txt for item max_sectors_kb to also mention the limit caused by BIO_MAX_SECTORS.
4. Possibly add an configuration option for kernel to support BIO size of 2MB or more. 

Any comments?

-----Original Message-----
From: Keith Busch [mailto:keith.busch@intel.com] 
Sent: Wednesday, January 06, 2016 2:55 PM
To: Xuehua Chen
Cc: linux-nvme at lists.infradead.org
Subject: Re: Maximum NVMe IO command size > 1MB?

On Wed, Jan 06, 2016@09:56:24PM +0000, Xuehua Chen wrote:
> Hi, Keith,
> 
> I wonder whether this could be caused by BIO_MAX_PAGES defined as 256, which means 1MB at most.
> What do you think?

I think you got it. You're running O_DIRECT, and fs/direct-io.c,
dio_new_bio() allocates up to BIO_MAX_PAGES.

I can't tell where the value for came from (looks like it was there from the very first git commit), but maybe you can propose raising it if you set BIO_MAX_PAGES higher without issue.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-01-10 22:16 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-06 19:23 Maximum NVMe IO command size > 1MB? Xuehua Chen
2016-01-06 19:31 ` Keith Busch
2016-01-06 19:51   ` Xuehua Chen
2016-01-06 21:56     ` Xuehua Chen
2016-01-06 22:54       ` Keith Busch
2016-01-07 17:38         ` Xuehua Chen
2016-01-10 22:16         ` Xuehua Chen
2016-01-07 11:39     ` Sagi Grimberg
2016-01-07 17:34       ` Xuehua Chen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.