On Thu, Jan 10, 2019 at 06:59:27PM +0800, Yongji Xie wrote:
> On Thu, 10 Jan 2019 at 18:25, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Wed, Jan 09, 2019 at 07:27:21PM +0800, elohimes@gmail.com wrote:
> > > From: Xie Yongji <xieyongji@baidu.com>
> > >
> > > This patchset is aimed at supporting qemu to reconnect
> > > vhost-user-blk backend after vhost-user-blk backend crash or
> > > restart.
> > >
> > > The patch 1 uses exisiting wait/nowait options to make QEMU not
> > > do a connect on client sockets during initialization of the chardev.
> > >
> > > The patch 2 introduces two new messages VHOST_USER_GET_INFLIGHT_FD
> > > and VHOST_USER_SET_INFLIGHT_FD to support providing shared
> > > memory to backend.
> >
> > Can you describe the problem that the inflight I/O shared memory region
> > solves?
> >
> 
> The backend need to get inflight I/O and do I/O replay after restart.
> Now we can only get used_idx in used_ring. It's not enough. Because we
> might not process descriptors in the same order which they have been
> made available sometimes. A simple example:
> https://patchwork.kernel.org/cover/10715305/#22375607. So we need a
> shared memory to track inflight I/O.

The inflight shared memory region is something that needs to be
discussed in detail to make sure it is correct not just for vhost-blk
but also for other vhost-user devices in the future.

Please expand the protocol documentation so that someone can implement
this feature without looking at the code.  Explain the reason for the
inflight shared memory region and how exactly it used.

After a quick look at the shared memory region documentation I have a
few questions:

1. The avail ring only contains "head" indices, each of which may chain
   non-head descriptors.  Does the inflight memory region track only the
   heads?

2. Does the inflight shared memory region preserve avail ring order?
   For example, if the guest submitted 5 2 4, will the new vhost-user
   backend keep that order or does it see 2 4 5?

3. What are the exact memory region size requirements?  Imagine a device
   with a large virtqueue.  Or a device with multiple virtqueues of
   different sizes.  Can your structure hand those cases?

I'm concerned that this approach to device recovery is invasive and hard
to test.  Instead I would use VIRTIO's Device Status Field
DEVICE_NEEDS_RESET bit to tell the guest driver that a reset is
necessary.  This is more disruptive - drivers either have to resubmit or
fail I/O with EIO - but it's also simple and more likely to work
correctly (it only needs to be implemented correctly in the guest
driver, not in the many available vhost-user backend implementations).

Stefan