Hi Niu, The NVMe driver is actually breaking this 128MB I/O into much smaller chunks. Most SSDs only support a max I/O size of 128KB-1MB. The problem you're seeing here is that the I/O is so big that the NVMe driver doesn't have enough context objects (struct nvme_request) to handle even a single 128MB I/O. Adding this kind of splitting for extremely big I/O inside of blobstore itself isn't planned. Blobstore splits on cluster boundaries because it's required. But there's a lot of complexity - especially when you have to do things like split an array of iovs. I'm guessing in your case you just have a single iov which would make splitting easy, but blobstore has to handle the splitting for all cases. We've built complex splitting into the bdev layer - I'd like to see if we can find a way for you to use that here instead. In lib/bdev/nvme/bdev_nvme.c - search for "optimal_io_boundary". This is currently set to the optimal_io_boundary for the namespace - some SSDs don't have an optimal boundary, most Intel SSDs specify 128KB here. Try setting the optimal_io_boundary to something like 8MB. Then also set bdev->disk.split_on_optimal_io_boundary = true. Then when you start sending 128MB I/O from blobstore, the bdev layer will split those on 8MB boundaries. If the NVMe driver runs out of nvme_request objects, the bdev layer will queue up those split requests and resubmit them once previous I/O are completed. If this looks like a workable solution for you, we can look at ways to configure these settings dynamically via the nvme bdev RPCs. -Jim On 3/21/19, 5:14 AM, "SPDK on behalf of Niu, Yawei" wrote: Thanks for the reply, Maciek. Yes, our cluster size is 1GB by default, and we have our own finer block allocator inside of blob (we need a 4k block size allocator, I'm afraid that's not feasible for the blob allocator), so using small cluster size isn't an option for us. Would you like to improve the blob io interface to split I/O according to backend bdev limitations (I think it's similar to the cross cluster boundary split)? Otherwise, we have to be aware of the bdev limitations underneath the blobstore, which looks not quite clean to me. What do you think? Thanks -Niu On 21/03/2019, 3:41 PM, "SPDK on behalf of Szwed, Maciej" wrote: Hi Niu, We do split I/O according to backend bdev limitations but only if you create bdev and use spdk_bdev_read/write/... commands. For blob interface there isn't any mechanism for that unfortunately. I'm guessing that you are using cluster size for blobs at least 128MB. You can try to set cluster size to lower value than the limitation of NVMe bdev and blobstore layer will always split I/O up to cluster size. Regards, Maciek -----Original Message----- From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Niu, Yawei Sent: Thursday, March 21, 2019 2:36 AM To: Storage Performance Development Kit Subject: [SPDK] io size limitation on spdk_blob_io_write() Hi, We discovered that spdk_blob_io_write() will fail with large io size (128MB) over NVMe bdev, I checked the SPDK code a bit, and seems the failure reason is the size exceeded the limitation of NVMe bdev io request size (which depends on the io queue depth & max transfer size). We may work around to the problem by splitting the io into several spdk_blob_io_write() calls, but I was wondering if blobstore should hide these bdev details/limitations for blobstore caller and split the I/O according to backend bdev limitations (just like what we did for cross cluster boundary io)? So that blobstore caller doesn’t need to differentiate what type of bdev underneath? Any thoughts? Thanks -Niu _______________________________________________ SPDK mailing list SPDK(a)lists.01.org https://lists.01.org/mailman/listinfo/spdk _______________________________________________ SPDK mailing list SPDK(a)lists.01.org https://lists.01.org/mailman/listinfo/spdk _______________________________________________ SPDK mailing list SPDK(a)lists.01.org https://lists.01.org/mailman/listinfo/spdk