On Mon, Jul 11, 2022 at 10:56:56PM +0300, Sagi Grimberg wrote: > >>>>@@ -448,6 +442,14 @@ static int nvme_uring_cmd_io(struct >>>>nvme_ctrl *ctrl, struct nvme_ns *ns, >>>>     pdu->meta_buffer = nvme_to_user_ptr(d.metadata); >>>>     pdu->meta_len = d.metadata_len; >>>>+    if (issue_flags & IO_URING_F_MPATH) { >>>>+        req->cmd_flags |= REQ_NVME_MPATH; >>>>+        /* >>>>+         * we would need the buffer address (d.addr field) if we have >>>>+         * to retry the command. Store it by repurposing ioucmd->cmd >>>>+         */ >>>>+        ioucmd->cmd = (void *)d.addr; >>> >>>What does repurposing mean here? >> >>This field (ioucmd->cmd) was pointing to passthrough command (which >>is embedded in SQE of io_uring). At this point we have consumed >>passthrough command, so this field can be reused if we have to. And we >>have to beceause failover needs recreating passthrough command. >>Please see nvme_uring_cmd_io_retry to see how this helps in recreating >>the fields of passthrough command. And more on this below. > >so ioucmd->cmd starts as an nvme_uring_cmd pointer and continues as >an address of buffer for a later processing that may or may not >happen in its lifetime? Yes. See this as a no-cost way to handle fast/common path. We manage by doing just this assignment. Otherwise this would have involved doing high-cost (allocate/copy/deferral) stuff for later processing that may or may not happen. >Sounds a bit of a backwards design to me. > >>>>+    } >>>>     blk_execute_rq_nowait(req, false); >>>>     return -EIOCBQUEUED; >>>> } >>>>@@ -665,12 +667,135 @@ int nvme_ns_head_chr_uring_cmd(struct >>>>io_uring_cmd *ioucmd, >>>>     int srcu_idx = srcu_read_lock(&head->srcu); >>>>     struct nvme_ns *ns = nvme_find_path(head); >>>>     int ret = -EINVAL; >>>>+    struct device *dev = &head->cdev_device; >>>>+ >>>>+    if (likely(ns)) { >>>>+        ret = nvme_ns_uring_cmd(ns, ioucmd, >>>>+                issue_flags | IO_URING_F_MPATH); >>>>+    } else if (nvme_available_path(head)) { >>>>+        struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); >>>>+        struct nvme_uring_cmd *payload = NULL; >>>>+ >>>>+        dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n"); >>>>+        /* >>>>+         * We may requeue two kinds of uring commands: >>>>+         * 1. command failed post submission. pdu->req will be non-null >>>>+         * for this >>>>+         * 2. command that could not be submitted because path was not >>>>+         * available. For this pdu->req is set to NULL. >>>>+         */ >>>>+        pdu->req = NULL; >>> >>>Relying on a pointer does not sound like a good idea to me. >>>But why do you care at all where did this command came from? >>>This code should not concern itself what happened prior to this >>>execution. >>Required, please see nvme_uring_cmd_io_retry. And this should be more >>clear as part of responses to your other questions. > >I think I understand. But it looks fragile to me. > >> >>>>+        /* >>>>+         * passthrough command will not be available during retry as it >>>>+         * is embedded in io_uring's SQE. Hence we allocate/copy here >>>>+         */ >>> >>>OK, that is a nice solution. >>Please note that prefered way is to recreate the passthrough command, >>and not to allocate it. We allocate it here because this happens very early >>(i.e. before processing passthrough command and setting that up inside >>struct request). Recreating requires a populated 'struct request' . > >I think that once a driver consumes a command as queued, it needs a >stable copy of the command (could be in a percpu pool). > >The current design of nvme multipathing is that the request is stripped >from its bios and placed on a requeue_list, and if a request was not >formed yet (i.e. nvme available path exist but no usable) the bio is >placed on the requeue_list. > >So ideally we have something sufficient like a bio, that can be linked >on a requeue list, and is sufficient to build a request, at any point in >time... we could be dealing with any passthrough command here. bio is not guranteed to exist in first place. It can very well be NULL. As I mentioned in cover letter, this was tested for such passthrough commands too. And bio is not a good fit for this interface. For block-path, entry function is nvme_ns_head_submit_bio() which speaks bio. Request is allocated afterwards. But since request formation can happen only after discovering a valid path, it may have to queue the bio if path does not exist. For passthrough, interface speaks/understands struct io_uring_cmd. Request is allocated after. And bio may (or may not) be attached with this request. It may have to queue the command even before request/bio gets formed. So cardinal structure (which is really available all the time) in this case is struct io_uring_cmd. Hence we use that as the object around which requeuing/failover is built. Conceptually io_uring_cmd is the bio of this interface. >>>>+        payload = kmalloc(sizeof(struct nvme_uring_cmd), GFP_KERNEL); >>>>+        if (!payload) { >>>>+            ret = -ENOMEM; >>>>+            goto out_unlock; >>>>+        } >>>>+        memcpy(payload, ioucmd->cmd, sizeof(struct nvme_uring_cmd)); >>>>+        ioucmd->cmd = payload; >>>>-    if (ns) >>>>-        ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags); >>>>+        spin_lock_irq(&head->requeue_ioucmd_lock); >>>>+        ioucmd_list_add(&head->requeue_ioucmd_list, ioucmd); >>>>+        spin_unlock_irq(&head->requeue_ioucmd_lock); >>>>+        ret = -EIOCBQUEUED; >>>>+    } else { >>>>+        dev_warn_ratelimited(dev, "no available path - failing I/O\n"); >>> >>>ret=-EIO ? >>Did not do as it was initialized to -EINVAL. Do you prefer -EIO instead. > >It is not an invalid argument error here, it is essentially an IO error. >So yes, that would be my preference. sure, will change. >>>>+    } >>>>+out_unlock: >>>>     srcu_read_unlock(&head->srcu, srcu_idx); >>>>     return ret; >>>> } >>>>+ >>>>+int nvme_uring_cmd_io_retry(struct nvme_ns *ns, struct request *oreq, >>>>+        struct io_uring_cmd *ioucmd, struct nvme_uring_cmd_pdu *pdu) >>>>+{ >>>>+    struct nvme_ctrl *ctrl = ns->ctrl; >>>>+    struct request_queue *q = ns ? ns->queue : ctrl->admin_q; >>>>+    struct nvme_uring_data d; >>>>+    struct nvme_command c; >>>>+    struct request *req; >>>>+    struct bio *obio = oreq->bio; >>>>+    void *meta = NULL; >>>>+ >>>>+    memcpy(&c, nvme_req(oreq)->cmd, sizeof(struct nvme_command)); >>>>+    d.metadata = (__u64)pdu->meta_buffer; >>>>+    d.metadata_len = pdu->meta_len; >>>>+    d.timeout_ms = oreq->timeout; >>>>+    d.addr = (__u64)ioucmd->cmd; >>>>+    if (obio) { >>>>+        d.data_len = obio->bi_iter.bi_size; >>>>+        blk_rq_unmap_user(obio); >>>>+    } else { >>>>+        d.data_len = 0; >>>>+    } >>>>+    blk_mq_free_request(oreq); >>> >>>The way I would do this that in nvme_ioucmd_failover_req (or in the >>>retry driven from command retriable failure) I would do the above, >>>requeue it and kick the requeue work, to go over the requeue_list and >>>just execute them again. Not sure why you even need an explicit retry >>>code. >>During retry we need passthrough command. But passthrough command is not >>stable (i.e. valid only during first submission). We can make it stable >>either by: >>(a) allocating in nvme (b) return -EAGAIN to io_uring, and it will >>do allocate + deferral >>Both add a cost. And since any command can potentially fail, that >>means taking that cost for every IO that we issue on mpath node. Even if >>no failure (initial or subsquent after IO) occcured. > >As mentioned, I think that if a driver consumes a command as queued, >it needs a stable copy for a later reformation of the request for >failover purposes. So what do you propose to make that stable? As I mentioned earlier, stable copy requires allocating/copying in fast path. And for a condition (failover) that may not even occur. I really think currrent solution is much better as it does not try to make it stable. Rather it assembles pieces of passthrough command if retry (which is rare) happens. >>So to avoid commmon-path cost, we go about doing nothing (no allocation, >>no deferral) in the outset and choose to recreate the passthrough >>command if failure occured. Hope this explains the purpose of >>nvme_uring_cmd_io_retry? > >I think it does, but I'm still having a hard time with it... > Maybe I am reiterating but few main differences that should help - - io_uring_cmd is at the forefront, and bio/request as secondary objects. Former is persistent object across requeue attempts and the only thing available when we discover the path, while other two are created every time we retry. - Receiving bio from upper layer is a luxury that we do not have for passthrough. When we receive bio, pages are already mapped and we do not have to deal with user-specific fields, so there is more freedom in using arbitrary context (workers etc.). But passthrough command continues to point to user-space fields/buffers, so we need that task context. >> >>>>+    req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(d.addr), >>>>+            d.data_len, nvme_to_user_ptr(d.metadata), >>>>+            d.metadata_len, 0, &meta, d.timeout_ms ? >>>>+            msecs_to_jiffies(d.timeout_ms) : 0, >>>>+            ioucmd->cmd_op == NVME_URING_CMD_IO_VEC, 0, 0); >>>>+    if (IS_ERR(req)) >>>>+        return PTR_ERR(req); >>>>+ >>>>+    req->end_io = nvme_uring_cmd_end_io; >>>>+    req->end_io_data = ioucmd; >>>>+    pdu->bio = req->bio; >>>>+    pdu->meta = meta; >>>>+    req->cmd_flags |= REQ_NVME_MPATH; >>>>+    blk_execute_rq_nowait(req, false); >>>>+    return -EIOCBQUEUED; >>>>+} >>>>+ >>>>+void nvme_ioucmd_mpath_retry(struct io_uring_cmd *ioucmd) >>>>+{ >>>>+    struct cdev *cdev = file_inode(ioucmd->file)->i_cdev; >>>>+    struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, >>>>+            cdev); >>>>+    int srcu_idx = srcu_read_lock(&head->srcu); >>>>+    struct nvme_ns *ns = nvme_find_path(head); >>>>+    unsigned int issue_flags = IO_URING_F_SQE128 | IO_URING_F_CQE32 | >>>>+        IO_URING_F_MPATH; >>>>+    struct device *dev = &head->cdev_device; >>>>+ >>>>+    if (likely(ns)) { >>>>+        struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); >>>>+        struct request *oreq = pdu->req; >>>>+        int ret; >>>>+ >>>>+        if (oreq == NULL) { >>>>+            /* >>>>+             * this was not submitted (to device) earlier. For this >>>>+             * ioucmd->cmd points to persistent memory. Free that >>>>+             * up post submission >>>>+             */ >>>>+            const void *cmd = ioucmd->cmd; >>>>+ >>>>+            ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags); >>>>+            kfree(cmd); >>>>+        } else { >>>>+            /* >>>>+             * this was submitted (to device) earlier. Use old >>>>+             * request, bio (if it exists) and nvme-pdu to recreate >>>>+             * the command for the discovered path >>>>+             */ >>>>+            ret = nvme_uring_cmd_io_retry(ns, oreq, ioucmd, pdu); >>> >>>Why is this needed? Why is reuse important here? Why not always call >>>nvme_ns_uring_cmd? >> >>Please see the previous explanation. >>If condition is for the case when we made the passthrough command stable >>by allocating beforehand. >>Else is for the case when we avoided taking that cost. > >The current design of the multipath failover code is clean: >1. extract bio(s) from request >2. link in requeue_list >3. schedule requeue_work that, > 3.1 takes bios 1-by-1, and submits them again (exactly the same way) It is really clean, and fits really well with bio based entry interface. But as I said earlier, it does not go well with uring-cmd based entry interface, and bunch of of other things which needs to be done differently for generic passthrough command. >I'd like us to try to follow the same design where retry is >literally "do the exact same thing, again". That would eliminate >two submission paths that do the same thing, but slightly different. Exact same thing is possible if we make the common path slow i.e. allocate/copy passthrough command and keep it alive until completion. But that is really not the way to go I suppose.