On Wed, Mar 08, 2023 at 06:19:41PM +0200, Max Gurtovoy wrote: > > > On 08/03/2023 16:13, Stefan Hajnoczi wrote: > > On Wed, Mar 08, 2023 at 01:17:33PM +0200, Max Gurtovoy wrote: > > > > > > > > > On 06/03/2023 18:25, Stefan Hajnoczi wrote: > > > > On Mon, Mar 06, 2023 at 05:28:03PM +0200, Max Gurtovoy wrote: > > > > > > > > > > > > > > > On 06/03/2023 13:20, Stefan Hajnoczi wrote: > > > > > > On Mon, Mar 06, 2023 at 04:00:50PM +0800, Jason Wang wrote: > > > > > > > > > > > > > > 在 2023/3/6 08:03, Stefan Hajnoczi 写道: > > > > > > > > On Sun, Mar 05, 2023 at 04:38:59AM -0500, Michael S. Tsirkin wrote: > > > > > > > > > On Fri, Mar 03, 2023 at 03:21:33PM -0500, Stefan Hajnoczi wrote: > > > > > > > > > > What happens if a command takes 1 second to complete, is the device > > > > > > > > > > allowed to process the next command from the virtqueue during this time, > > > > > > > > > > possibly completing it before the first command? > > > > > > > > > > > > > > > > > > > > This requires additional clarification in the spec because "they are > > > > > > > > > > processed by the device in the order in which they are queued" does not > > > > > > > > > > explain whether commands block the virtqueue (in order completion) or > > > > > > > > > > not (out of order completion). > > > > > > > > > Oh I begin to see. Hmm how does e.g. virtio scsi handle this? > > > > > > > > virtio-scsi, virtio-blk, and NVMe requests may complete out of order. > > > > > > > > Several may be processed by the device at the same time. > > > > > > > > > > > > > > > > They rely on multi-queue for abort operations: > > > > > > > > > > > > > > > > In virtio-scsi the abort requests (VIRTIO_SCSI_T_TMF_ABORT_TASK) are > > > > > > > > sent on the control virtqueue. The the request identifier namespace is > > > > > > > > shared across all virtqueues so it's possible to abort a request that > > > > > > > > was submitted to any command virtqueue. > > > > > > > > > > > > > > > > NVMe also follows the same design where abort commands are sent on the > > > > > > > > Admin Submission Queue instead of an I/O Submission Queue. It's possible > > > > > > > > to identify NVMe requests by . > > > > > > > > > > > > > > > > virtio-blk doesn't support aborting requests. > > > > > > > > > > > > > > > > I think the logic behind this design is that if a queue gets stuck > > > > > > > > processing long-running requests, then the device should not be forced > > > > > > > > to perform lookahead in the queue to find abort commands. A separate > > > > > > > > control/admin queue is used for the abort requests. > > > > > > > > > > > > > > > > > > > > > Or device need mandate some kind of QOS here, e.g a request must be complete > > > > > > > in some time. Otherwise we don't have sufficient reliability for using it as > > > > > > > management task? > > > > > > > > > > > > Yes, if all commands can be executed in bounded time then a guarantee is > > > > > > possible. > > > > > > > > > > > > Here is an example where that's hard: imagine a virtio-blk device backed > > > > > > by network storage. When an admin queue command is used to delete a > > > > > > group member, any of the group member's in-flight I/O requests need to > > > > > > be aborted. If the network hangs while the group member is being > > > > > > deleted, then the device can't complete an orderly shutdown of I/O > > > > > > requests in a reasonable time. > > > > > > > > > > > > That example shows a basic group admin command that I think Michael is > > > > > > about to propose. We can't avoid this problem by not making it a group > > > > > > admin command - it needs to be a group admin command. So I think it's > > > > > > likely that there will be admin commands that take an unbounded amount > > > > > > of time to complete. One way to achieve what you mentioned is timeouts. > > > > > > > > > > I think that you're getting into device specific implementation details and > > > > > I'm not sure it's necessary. > > > > > > > > > > I don't think we need to abort admin commands. Admin commands can be > > > > > flushed/aborted during the device reset phase. > > > > > Only IO commands should have the possibility to being aborted as you > > > > > mentioned in NVMe and SCSI (and potentially in virtio-blk). > > > > > > > > It's a general design issue that should be clarified now rather than > > > > being left unspecified. > > > > > > > > I'm not saying that it must be possible to abort admin commands. There > > > > are other options, like requiring the device itself to fail a command > > > > after a timeout. > > > > > > do you have an example of timeout today for control vq ? > > > > Do you mean the virtio-net control virtqueue? I don't think it has any > > commands with an unbounded execution time. > > Correct. So why introducing it now ? The examples I've given are the create and delete group member operations. I think those operations will take unbounded time in some device implementations. > > > > > > > > > > > Or we could say that admin commands must complete within bounded time, > > > > but I'm not sure that is implementable for some device types like > > > > virtio-blk, virtio-scsi, and virtiofs. > > > > > > No we can't. > > > Some commands, for example FW upgrade can take 10 minutes and it's perfectly > > > fine. Other commands like setting feature bit will take 1 millisec. > > > Each device implements commands in a different internal logic so we can't > > > expect to complete after X time. > > > > When I say bounded time, I mean that it finishes in a finite amount of > > time. I'm not saying there is a specific time X that all device > > implementations must satisfy. Unbounded means it might never finish. > > There might be a chance that any command for any virtio device type will > never finish. Nothing new here in the adminq. > > what one can do is to set a timeout for himself and if this timeout expire - > check the device status. If it needs_reset - do a reset. if status is ok, > then wait some more time. > After X retries, unmap buffers or reset the adminq. Michael: What effect does resetting the group owner device have on group member devices? I'm concerned that this approach disrupts all group member devices. For example, you try to add a new device but the command hangs. In order to recover you now have to reset the group owner device and this breaks all the group member devices. > > > > > Device can go to so FATAL state in case a command is stuck and causing > > > internal errors in it. > > > > > > > > > > > > For your example, stopping a member is possible even it there are some > > > > > errors in the network. You can for example destroy all the connections to > > > > > the remote target and complete all the BIOS with some error. > > > > > > > > Forgetting about in-flight requests doesn't necessarily make them go > > > > away. It creates a race between forgotten requests and reconnection. In > > > > the worst case a forgotten write request takes effect after > > > > reconnection, causing data corruption. > > > > > > For making it work without data corruption we need a cooperation of the > > > target side for sure. But this is fine since the target in that case is part > > > of the "virtio-blk backend". > > > One solution is that the target can decide it will flush all the requests to > > > the storage device before accepting new connections. > > > > This solution shifts the unbounded time from disconnection to > > connection. The Group Member Delete command will complete quickly but a > > subsequent Group Member Create command for the same underlying storage > > device would need to wait until the requests are done. > > > > Therefore I think the admin queue must be designed under the assumption > > that some commands take a very long time. > > For sure an admin command may take long time. FW upgrade can take 10 minutes > for example. > But each device is free to implement internal logic as he choose. > > Same for live migration, when we stop/quiesce a device we must make sure it > doesn't master any DMA operations. Thus, in some implementations we need to > wait for all inflights to end fast. In others, we can invalidate the access > to host/guest memory and wait for completions until the freeze state. > > Bottom line, this is device implementation specific consideration. What I'm asking is that the spec clarifies the command completion order semantics (in-order or out-of-order), whether there is a mechanism to abort commands, etc. Device implementers can then take advantage of those aspects to implement devices that don't hang (e.g. health monitoring becomes unavailable when there is a long running command). If the spec doesn't cover this, then device implementers will not be able to work around it when implementing standard commands like create/delete group member. Does that make sense? Stefan