All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [SPDK] io size limitation on spdk_blob_io_write()
@ 2019-03-21  7:40 Szwed, Maciej
  0 siblings, 0 replies; 5+ messages in thread
From: Szwed, Maciej @ 2019-03-21  7:40 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 1545 bytes --]

Hi Niu,
We do split I/O according to backend bdev limitations but only if you create bdev and use spdk_bdev_read/write/... commands. For blob interface there isn't any mechanism for that unfortunately.
I'm guessing that you are using cluster size for blobs at least 128MB. You can try to set cluster size to lower value than the limitation of NVMe bdev and blobstore layer will always split I/O up to cluster size.

Regards,
Maciek

-----Original Message-----
From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Niu, Yawei
Sent: Thursday, March 21, 2019 2:36 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: [SPDK] io size limitation on spdk_blob_io_write()

Hi,

We discovered that spdk_blob_io_write() will fail with large io size (128MB) over NVMe bdev, I checked the SPDK code a bit, and seems the failure reason is the size exceeded the limitation of NVMe bdev io request size (which depends on the io queue depth & max transfer size).
We may work around to the problem by splitting the io into several spdk_blob_io_write() calls, but I was wondering if blobstore should hide these bdev details/limitations for blobstore caller and split the I/O according to backend bdev limitations (just like what we did for cross cluster boundary io)? So that blobstore caller doesn’t need to differentiate what type of bdev underneath? Any thoughts?

Thanks
-Niu
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [SPDK] io size limitation on spdk_blob_io_write()
@ 2019-03-21 16:40 Niu, Yawei
  0 siblings, 0 replies; 5+ messages in thread
From: Niu, Yawei @ 2019-03-21 16:40 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 5395 bytes --]

Hi, Jim

Thanks for the explanation, and your proposal is definitely a workable solution for us.  BTW, we have our own control plane on server calls the SPDK API directly, so what we need is just some exported API to configure it dynamically. Thanks for your help!

- Niu

On 21/03/2019, 11:48 PM, "SPDK on behalf of Harris, James R" <spdk-bounces(a)lists.01.org on behalf of james.r.harris(a)intel.com> wrote:

    Hi Niu,
    
    The NVMe driver is actually breaking this 128MB I/O into much smaller chunks.  Most SSDs only support a max I/O size of 128KB-1MB.  The problem you're seeing here is that the I/O is so big that the NVMe driver doesn't have enough context objects (struct nvme_request) to handle even a single 128MB I/O.
    
    Adding this kind of splitting for extremely big I/O inside of blobstore itself isn't planned.  Blobstore splits on cluster boundaries because it's required.  But there's a lot of complexity - especially when you have to do things like split an array of iovs.  I'm guessing in your case you just have a single iov which would make splitting easy, but blobstore has to handle the splitting for all cases.  We've built complex splitting into the bdev layer - I'd like to see if we can find a way for you to use that here instead.
    
    In lib/bdev/nvme/bdev_nvme.c - search for "optimal_io_boundary".  This is currently set to the optimal_io_boundary for the namespace - some SSDs don't have an optimal boundary, most Intel SSDs specify 128KB here.  Try setting the optimal_io_boundary to something like 8MB.  Then also set bdev->disk.split_on_optimal_io_boundary = true.  Then when you start sending 128MB I/O from blobstore, the bdev layer will split those on 8MB boundaries.  If the NVMe driver runs out of nvme_request objects, the bdev layer will queue up those split requests and resubmit them once previous I/O are completed.
    
    If this looks like a workable solution for you, we can look at ways to configure these settings dynamically via the nvme bdev RPCs.
    
    -Jim
    
    
    
    
    On 3/21/19, 5:14 AM, "SPDK on behalf of Niu, Yawei" <spdk-bounces(a)lists.01.org on behalf of yawei.niu(a)intel.com> wrote:
    
        Thanks for the reply, Maciek.
        
        Yes, our cluster size is 1GB by default, and we have our own finer block allocator inside of blob (we need a 4k block size allocator, I'm afraid that's not feasible for the blob allocator), so using small cluster size isn't an option for us.
        Would you like to improve the blob io interface to split I/O according to backend bdev limitations (I think it's similar to the cross cluster boundary split)? Otherwise, we have to be aware of the bdev limitations underneath the blobstore, which looks not quite clean to me. What do you think?
        
        Thanks
        -Niu
        
        On 21/03/2019, 3:41 PM, "SPDK on behalf of Szwed, Maciej" <spdk-bounces(a)lists.01.org on behalf of maciej.szwed(a)intel.com> wrote:
        
            Hi Niu,
            We do split I/O according to backend bdev limitations but only if you create bdev and use spdk_bdev_read/write/... commands. For blob interface there isn't any mechanism for that unfortunately.
            I'm guessing that you are using cluster size for blobs at least 128MB. You can try to set cluster size to lower value than the limitation of NVMe bdev and blobstore layer will always split I/O up to cluster size.
            
            Regards,
            Maciek
            
            -----Original Message-----
            From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Niu, Yawei
            Sent: Thursday, March 21, 2019 2:36 AM
            To: Storage Performance Development Kit <spdk(a)lists.01.org>
            Subject: [SPDK] io size limitation on spdk_blob_io_write()
            
            Hi,
            
            We discovered that spdk_blob_io_write() will fail with large io size (128MB) over NVMe bdev, I checked the SPDK code a bit, and seems the failure reason is the size exceeded the limitation of NVMe bdev io request size (which depends on the io queue depth & max transfer size).
            We may work around to the problem by splitting the io into several spdk_blob_io_write() calls, but I was wondering if blobstore should hide these bdev details/limitations for blobstore caller and split the I/O according to backend bdev limitations (just like what we did for cross cluster boundary io)? So that blobstore caller doesn’t need to differentiate what type of bdev underneath? Any thoughts?
            
            Thanks
            -Niu
            _______________________________________________
            SPDK mailing list
            SPDK(a)lists.01.org
            https://lists.01.org/mailman/listinfo/spdk
            _______________________________________________
            SPDK mailing list
            SPDK(a)lists.01.org
            https://lists.01.org/mailman/listinfo/spdk
            
        
        _______________________________________________
        SPDK mailing list
        SPDK(a)lists.01.org
        https://lists.01.org/mailman/listinfo/spdk
        
    
    _______________________________________________
    SPDK mailing list
    SPDK(a)lists.01.org
    https://lists.01.org/mailman/listinfo/spdk
    


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [SPDK] io size limitation on spdk_blob_io_write()
@ 2019-03-21 15:47 Harris, James R
  0 siblings, 0 replies; 5+ messages in thread
From: Harris, James R @ 2019-03-21 15:47 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 4566 bytes --]

Hi Niu,

The NVMe driver is actually breaking this 128MB I/O into much smaller chunks.  Most SSDs only support a max I/O size of 128KB-1MB.  The problem you're seeing here is that the I/O is so big that the NVMe driver doesn't have enough context objects (struct nvme_request) to handle even a single 128MB I/O.

Adding this kind of splitting for extremely big I/O inside of blobstore itself isn't planned.  Blobstore splits on cluster boundaries because it's required.  But there's a lot of complexity - especially when you have to do things like split an array of iovs.  I'm guessing in your case you just have a single iov which would make splitting easy, but blobstore has to handle the splitting for all cases.  We've built complex splitting into the bdev layer - I'd like to see if we can find a way for you to use that here instead.

In lib/bdev/nvme/bdev_nvme.c - search for "optimal_io_boundary".  This is currently set to the optimal_io_boundary for the namespace - some SSDs don't have an optimal boundary, most Intel SSDs specify 128KB here.  Try setting the optimal_io_boundary to something like 8MB.  Then also set bdev->disk.split_on_optimal_io_boundary = true.  Then when you start sending 128MB I/O from blobstore, the bdev layer will split those on 8MB boundaries.  If the NVMe driver runs out of nvme_request objects, the bdev layer will queue up those split requests and resubmit them once previous I/O are completed.

If this looks like a workable solution for you, we can look at ways to configure these settings dynamically via the nvme bdev RPCs.

-Jim




On 3/21/19, 5:14 AM, "SPDK on behalf of Niu, Yawei" <spdk-bounces(a)lists.01.org on behalf of yawei.niu(a)intel.com> wrote:

    Thanks for the reply, Maciek.
    
    Yes, our cluster size is 1GB by default, and we have our own finer block allocator inside of blob (we need a 4k block size allocator, I'm afraid that's not feasible for the blob allocator), so using small cluster size isn't an option for us.
    Would you like to improve the blob io interface to split I/O according to backend bdev limitations (I think it's similar to the cross cluster boundary split)? Otherwise, we have to be aware of the bdev limitations underneath the blobstore, which looks not quite clean to me. What do you think?
    
    Thanks
    -Niu
    
    On 21/03/2019, 3:41 PM, "SPDK on behalf of Szwed, Maciej" <spdk-bounces(a)lists.01.org on behalf of maciej.szwed(a)intel.com> wrote:
    
        Hi Niu,
        We do split I/O according to backend bdev limitations but only if you create bdev and use spdk_bdev_read/write/... commands. For blob interface there isn't any mechanism for that unfortunately.
        I'm guessing that you are using cluster size for blobs at least 128MB. You can try to set cluster size to lower value than the limitation of NVMe bdev and blobstore layer will always split I/O up to cluster size.
        
        Regards,
        Maciek
        
        -----Original Message-----
        From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Niu, Yawei
        Sent: Thursday, March 21, 2019 2:36 AM
        To: Storage Performance Development Kit <spdk(a)lists.01.org>
        Subject: [SPDK] io size limitation on spdk_blob_io_write()
        
        Hi,
        
        We discovered that spdk_blob_io_write() will fail with large io size (128MB) over NVMe bdev, I checked the SPDK code a bit, and seems the failure reason is the size exceeded the limitation of NVMe bdev io request size (which depends on the io queue depth & max transfer size).
        We may work around to the problem by splitting the io into several spdk_blob_io_write() calls, but I was wondering if blobstore should hide these bdev details/limitations for blobstore caller and split the I/O according to backend bdev limitations (just like what we did for cross cluster boundary io)? So that blobstore caller doesn’t need to differentiate what type of bdev underneath? Any thoughts?
        
        Thanks
        -Niu
        _______________________________________________
        SPDK mailing list
        SPDK(a)lists.01.org
        https://lists.01.org/mailman/listinfo/spdk
        _______________________________________________
        SPDK mailing list
        SPDK(a)lists.01.org
        https://lists.01.org/mailman/listinfo/spdk
        
    
    _______________________________________________
    SPDK mailing list
    SPDK(a)lists.01.org
    https://lists.01.org/mailman/listinfo/spdk
    


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [SPDK] io size limitation on spdk_blob_io_write()
@ 2019-03-21 12:08 Niu, Yawei
  0 siblings, 0 replies; 5+ messages in thread
From: Niu, Yawei @ 2019-03-21 12:08 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 2526 bytes --]

Thanks for the reply, Maciek.

Yes, our cluster size is 1GB by default, and we have our own finer block allocator inside of blob (we need a 4k block size allocator, I'm afraid that's not feasible for the blob allocator), so using small cluster size isn't an option for us.
Would you like to improve the blob io interface to split I/O according to backend bdev limitations (I think it's similar to the cross cluster boundary split)? Otherwise, we have to be aware of the bdev limitations underneath the blobstore, which looks not quite clean to me. What do you think?

Thanks
-Niu

On 21/03/2019, 3:41 PM, "SPDK on behalf of Szwed, Maciej" <spdk-bounces(a)lists.01.org on behalf of maciej.szwed(a)intel.com> wrote:

    Hi Niu,
    We do split I/O according to backend bdev limitations but only if you create bdev and use spdk_bdev_read/write/... commands. For blob interface there isn't any mechanism for that unfortunately.
    I'm guessing that you are using cluster size for blobs at least 128MB. You can try to set cluster size to lower value than the limitation of NVMe bdev and blobstore layer will always split I/O up to cluster size.
    
    Regards,
    Maciek
    
    -----Original Message-----
    From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Niu, Yawei
    Sent: Thursday, March 21, 2019 2:36 AM
    To: Storage Performance Development Kit <spdk(a)lists.01.org>
    Subject: [SPDK] io size limitation on spdk_blob_io_write()
    
    Hi,
    
    We discovered that spdk_blob_io_write() will fail with large io size (128MB) over NVMe bdev, I checked the SPDK code a bit, and seems the failure reason is the size exceeded the limitation of NVMe bdev io request size (which depends on the io queue depth & max transfer size).
    We may work around to the problem by splitting the io into several spdk_blob_io_write() calls, but I was wondering if blobstore should hide these bdev details/limitations for blobstore caller and split the I/O according to backend bdev limitations (just like what we did for cross cluster boundary io)? So that blobstore caller doesn’t need to differentiate what type of bdev underneath? Any thoughts?
    
    Thanks
    -Niu
    _______________________________________________
    SPDK mailing list
    SPDK(a)lists.01.org
    https://lists.01.org/mailman/listinfo/spdk
    _______________________________________________
    SPDK mailing list
    SPDK(a)lists.01.org
    https://lists.01.org/mailman/listinfo/spdk
    


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [SPDK] io size limitation on spdk_blob_io_write()
@ 2019-03-21  1:36 Niu, Yawei
  0 siblings, 0 replies; 5+ messages in thread
From: Niu, Yawei @ 2019-03-21  1:36 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 706 bytes --]

Hi,

We discovered that spdk_blob_io_write() will fail with large io size (128MB) over NVMe bdev, I checked the SPDK code a bit, and seems the failure reason is the size exceeded the limitation of NVMe bdev io request size (which depends on the io queue depth & max transfer size).
We may work around to the problem by splitting the io into several spdk_blob_io_write() calls, but I was wondering if blobstore should hide these bdev details/limitations for blobstore caller and split the I/O according to backend bdev limitations (just like what we did for cross cluster boundary io)? So that blobstore caller doesn’t need to differentiate what type of bdev underneath? Any thoughts?

Thanks
-Niu

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-03-21 16:40 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-21  7:40 [SPDK] io size limitation on spdk_blob_io_write() Szwed, Maciej
  -- strict thread matches above, loose matches on Subject: below --
2019-03-21 16:40 Niu, Yawei
2019-03-21 15:47 Harris, James R
2019-03-21 12:08 Niu, Yawei
2019-03-21  1:36 Niu, Yawei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.