Re: [SPDK] Order of callback and free_request during NVMe completions

* Re: [SPDK] Order of callback and free_request during NVMe completions
@ 2017-08-04 23:46 Harris, James R
  0 siblings, 0 replies; 3+ messages in thread
From: Harris, James R @ 2017-08-04 23:46 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 6550 bytes --]

Hi Lance,

Great question.  In a lot of cases, saving off cb_fn/cb_arg would work as you suggest.  But there are a few cases where it would not – I’ll try to explain.

The io_queue_requests value represents how many nvme_request structures are allocated per IO queue pair.  In the normal case, one call to spdk_nvme_read/write (or one of its variants) consumes one nvme_request structure and results in one I/O to the SSD.  But there are a few cases where we need to split the user request into multiple I/O to the SSD – each of which will consume its own nvme_request structure (plus one for the original “parent” request).

1) If the user request exceeds the controller’s MDTS (Max Data Transfer Size), SPDK will split the request into multiple I/O, each of which is MDTS or less in size.
2) Some Intel NVMe SSDs benefit from splitting I/O in host software if the I/O would otherwise span a 128KB boundary.  For example, a 64KB I/O that starts at offset 96KB (LBA 192 on a 512 byte formatted namespace) would span such a 128KB boundary.  The SSD will handle such an I/O correctly, but performance will be better if the driver splits it into two I/O – 32KB with offset LBA 192, and 32KB with offset LBA 256.  The SPDK NVMe driver does this splitting to ensure the best performance on these SSDs and relieve the application from the burden of understanding these boundaries and doing the splits themselves.
3) Most NVMe SSDs do not support SGL for describing the host memory buffers, but instead uses a format called PRP (Physical Region Pages).  PRP has some limitations compared to full SGL.  If you’re not familiar with PRP, the NVMe specification has a good explanation.  Some usage models such as vhost generate I/O that are not PRP compatible, which also requires splitting the I/O into two or more children so that each is PRP compatible.

So, in the scenario we are discussing, if one request completes, but the new I/O (submitted from the user’s callback) requires more than one request due to one of these conditions, the spdk_nvme_read/write function would still have an insufficient number of nvme_request structures.

Your suggestion is still a good one especially for applications that know they will not submit I/O which require splitting, and that want to depend on an ENOMEM to stop submitting new I/O requests and wait for a completion to submit a new one in the context of the completion handler.  Would you be willing to submit your patch to GerritHub so the SPDK community can review it and merge it upstream?

Thanks,

-Jim

P.S. - There is also the matter of nvme_tracker structures.  An nvme_tracker structure is tied to an nvme_request that has been submitted to the SSD.  nvme_tracker structures have memory allocated for a worst-case PRP list (4KB) and one is allocated per slot in the NVMe qpair (io_queue_size).  Note that this is a lot more memory than the size of an nvme_request (248 bytes).  If an nvme_request is submitted to the qpair and no nvme_tracker is available, that nvme_request is queued and will be submitted later once an nvme_tracker becomes available.  So freeing the nvme_request earlier would usually allow the application to submit a new I/O from the context of the completion callback, but there would be a bit of inefficiency if the nvme_tracker was not also freed earlier.  This would add more complication to the I/O path.  This needs more investigation though to quantify the level of inefficiency to determine if the extra complication is even warranted.

On 8/4/17, 8:58 AM, "SPDK on behalf of Lance Hartmann ORACLE" <spdk-bounces(a)lists.01.org on behalf of lance.hartmann(a)oracle.com> wrote:

    Hello,

    I just joined the SPDK email list after dabbling with the SPDK for a few weeks.  I looked through the SPDK email list’s archives and didn’t see any subjects focused on the topic of order of operations in NVMe completions, hence my inaugural posting to this list begins with a question about this.  I did see and read through the thread in July 2016 (Re:  Callback passed to spdk_nvme_ns_cmd_read not being called sometimes) but it didn’t touch on the specific question I have.  Please accept my apologies in advance if I missed something else related to NVMe completions when scanning through the SPDK email archives.

    In the SPDK NVMe driver I noted that the order of operations in processing completions is as follows:

    	1.  If a callback was provided, issue the callback.
    	2.  Invoke nvme_free_request().

    As the caller’s callback info was saved in the nvme_request struct allocated during the I/O request, it appears logical that one would reference those respective callback fields and invoke the callback, before freeing the request struct that contains them.  However, I’m curious to know if there’s any risk or possible concern for diverging from the programming model if the following sequence is done instead:

    	1.  Save off the req->cb_fn and req->cb_arg in new temporary pointers (local variables on the stack of the completion function).
    	2.  Invoke nvme_free_request() for the req.
    	3.  If a callback was provided (as saved in the new temporary pointer), issue the callback via the temporary saved function pointer and with the temporary ptr to the callback arg.

    I’m inquiring about this alteration for the scenario where in the I/O completion callback code path, the application may (and likely will) wish to issue a new I/O.  However, if this is the first completion invoked after a maximum number of requests (based on opts->io_queue_requests) have already been submitted, then the I/O callback’s attempt to issue a new I/O will fail because there are no free requests available based on the current implementation of the completion code.  If on the other hand, the completion code is altered to invoke nvme_free_request() before calling the applications’s I/O callback, a request entry will be available to that application as it was just freed, thus enabling it to issue a new I/O successfully.

    I made such a change in my sandbox and this appeared to operate, but as I’m new to the SPDK NVMe driver I’m eager to solicit feedback from the seasoned developers here to learn if I’m overlooking something.

    regards,

    _______________________________________________
    SPDK mailing list
    SPDK(a)lists.01.org
    https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 3+ messages in thread