linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Stefan Hajnoczi <stefanha@redhat.com>
To: Jason Wang <jasowang@redhat.com>
Cc: "Xie Yongji" <xieyongji@bytedance.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	"Stefano Garzarella" <sgarzare@redhat.com>,
	"Parav Pandit" <parav@nvidia.com>,
	"Christoph Hellwig" <hch@infradead.org>,
	"Christian Brauner" <christian.brauner@canonical.com>,
	"Randy Dunlap" <rdunlap@infradead.org>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	"Jens Axboe" <axboe@kernel.dk>, "bcrl@kvack.org" <bcrl@kvack.org>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Mika Penttilä" <mika.penttila@nextfour.com>,
	"Dan Carpenter" <dan.carpenter@oracle.com>,
	joro@8bytes.org, gregkh@linuxfoundation.org,
	"songmuchun@bytedance.com" <songmuchun@bytedance.com>,
	virtualization@lists.linux-foundation.org,
	netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-fsdevel@vger.kernel.org,
	"iommu@lists.linux-foundation.org"
	<iommu@lists.linux-foundation.org>,
	linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE
Date: Wed, 7 Jul 2021 16:54:26 +0100	[thread overview]
Message-ID: <YOXOMiPl7mKd7FoM@stefanha-x1.localdomain> (raw)
In-Reply-To: <a03c8627-7dac-2255-a2d9-603fc623b618@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 8501 bytes --]

On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote:
> 
> 在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:
> > On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:
> > > 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:
> > > > On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:
> > > > > On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
> > > > > > > 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
> > > > > > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > > > > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > > > > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > > > > > > > > configuration space is generally used for rarely-changing or
> > > > > > > > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > > ioctl should not be called frequently.
> > > > > > > > > > > The spec uses MUST and other terms to define the precise requirements.
> > > > > > > > > > > Here the language (especially the word "generally") is weaker and means
> > > > > > > > > > > there may be exceptions.
> > > > > > > > > > > 
> > > > > > > > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > approach is reads that have side-effects. For example, imagine a field
> > > > > > > > > > > containing an error code if the device encounters a problem unrelated to
> > > > > > > > > > > a specific virtqueue request. Reading from this field resets the error
> > > > > > > > > > > code to 0, saving the driver an extra configuration space write access
> > > > > > > > > > > and possibly race conditions. It isn't possible to implement those
> > > > > > > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > > > > > > > > > makes me think that the interface does not allow full VIRTIO semantics.
> > > > > > > > > Note that though you're correct, my understanding is that config space is
> > > > > > > > > not suitable for this kind of error propagating. And it would be very hard
> > > > > > > > > to implement such kind of semantic in some transports.  Virtqueue should be
> > > > > > > > > much better. As Yong Ji quoted, the config space is used for
> > > > > > > > > "rarely-changing or intialization-time parameters".
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > > > > > > > handle the message failure, I'm going to add a return value to
> > > > > > > > > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > > > > > > > > be propagated to the virtio device driver. Then the virtio-blk device
> > > > > > > > > > driver can be modified to handle that.
> > > > > > > > > > 
> > > > > > > > > > Jason and Stefan, what do you think of this way?
> > > > > > > > Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
> > > > > > > > 
> > > > > > > > The VIRTIO spec provides no way for the device to report errors from
> > > > > > > > config space accesses.
> > > > > > > > 
> > > > > > > > The QEMU virtio-pci implementation returns -1 from invalid
> > > > > > > > virtio_config_read*() and silently discards virtio_config_write*()
> > > > > > > > accesses.
> > > > > > > > 
> > > > > > > > VDUSE can take the same approach with
> > > > > > > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> > > > > > > > 
> > > > > > > > > I'd like to stick to the current assumption thich get_config won't fail.
> > > > > > > > > That is to say,
> > > > > > > > > 
> > > > > > > > > 1) maintain a config in the kernel, make sure the config space read can
> > > > > > > > > always succeed
> > > > > > > > > 2) introduce an ioctl for the vduse usersapce to update the config space.
> > > > > > > > > 3) we can synchronize with the vduse userspace during set_config
> > > > > > > > > 
> > > > > > > > > Does this work?
> > > > > > > > I noticed that caching is also allowed by the vhost-user protocol
> > > > > > > > messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
> > > > > > > > know whether or not caching is in effect. The interface you outlined
> > > > > > > > above requires caching.
> > > > > > > > 
> > > > > > > > Is there a reason why the host kernel vDPA code needs to cache the
> > > > > > > > configuration space?
> > > > > > > Because:
> > > > > > > 
> > > > > > > 1) Kernel can not wait forever in get_config(), this is the major difference
> > > > > > > with vhost-user.
> > > > > > virtio_cread() can sleep:
> > > > > > 
> > > > > >     #define virtio_cread(vdev, structname, member, ptr)                     \
> > > > > >             do {                                                            \
> > > > > >                     typeof(((structname*)0)->member) virtio_cread_v;        \
> > > > > >                                                                             \
> > > > > >                     might_sleep();                                          \
> > > > > >                     ^^^^^^^^^^^^^^
> > > > > > 
> > > > > > Which code path cannot sleep?
> > > > > Well, it can sleep but it can't sleep forever. For VDUSE, a
> > > > > buggy/malicious userspace may refuse to respond to the get_config.
> > > > > 
> > > > > It looks to me the ideal case, with the current virtio spec, for VDUSE is to
> > > > > 
> > > > > 1) maintain the device and its state in the kernel, userspace may sync
> > > > > with the kernel device via ioctls
> > > > > 2) offload the datapath (virtqueue) to the userspace
> > > > > 
> > > > > This seems more robust and safe than simply relaying everything to
> > > > > userspace and waiting for its response.
> > > > > 
> > > > > And we know for sure this model can work, an example is TUN/TAP:
> > > > > netdevice is abstracted in the kernel and datapath is done via
> > > > > sendmsg()/recvmsg().
> > > > > 
> > > > > Maintaining the config in the kernel follows this model and it can
> > > > > simplify the device generation implementation.
> > > > > 
> > > > > For config space write, it requires more thought but fortunately it's
> > > > > not commonly used. So VDUSE can choose to filter out the
> > > > > device/features that depends on the config write.
> > > > This is the problem. There are other messages like SET_FEATURES where I
> > > > guess we'll face the same challenge.
> > > 
> > > Probably not, userspace device can tell the kernel about the device_features
> > > and mandated_features during creation, and the feature negotiation could be
> > > done purely in the kernel without bothering the userspace.
> 
> 
> (For some reason I drop the list accidentally, adding them back, sorry)
> 
> 
> > Sorry, I confused the messages. I meant SET_STATUS. It's a synchronous
> > interface where the driver waits for the device.
> 
> 
> It depends on how we define "synchronous" here. If I understand correctly,
> the spec doesn't expect there will be any kind of failure for the operation
> of set_status itself.
> 
> Instead, anytime it want any synchronization, it should be done via
> get_status():
> 
> 1) re-read device status to make sure FEATURES_OK is set during feature
> negotiation
> 2) re-read device status to be 0 to make sure the device has finish the
> reset
> 
> 
> > 
> > VDUSE currently doesn't wait for the device emulation process to handle
> > this message (no reply is needed) but I think this is a mistake because
> > VDUSE is not following the VIRTIO device model.
> 
> 
> With the trick that is done for FEATURES_OK above, I think we don't need to
> wait for the reply.
> 
> If userspace takes too long to respond, it can be detected since
> get_status() doesn't return the expected value for long time.
> 
> And for the case that needs a timeout, we probably can use NEEDS_RESET.

I think you're right. get_status is the synchronization point, not
set_status.

Currently there is no VDUSE GET_STATUS message. The
VDUSE_START/STOP_DATAPLANE messages could be changed to SET_STATUS so
that the device emulation program can participate in emulating the
Device Status field. This change could affect VDUSE's VIRTIO feature
interface since the device emulation program can reject features by not
setting FEATURES_OK.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2021-07-07 15:54 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-15 14:13 [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
2021-06-15 14:13 ` [PATCH v8 01/10] iova: Export alloc_iova_fast() and free_iova_fast(); Xie Yongji
2021-06-15 14:13 ` [PATCH v8 02/10] file: Export receive_fd() to modules Xie Yongji
2021-06-15 14:13 ` [PATCH v8 03/10] eventfd: Increase the recursion depth of eventfd_signal() Xie Yongji
2021-06-17  8:33   ` He Zhe
2021-06-18  3:29     ` Yongji Xie
2021-06-18  8:41       ` He Zhe
2021-06-18  8:44       ` [PATCH] eventfd: Enlarge recursion limit to allow vhost to work He Zhe
2021-07-03  8:31         ` Michael S. Tsirkin
2021-08-25  7:57         ` Yongji Xie
2021-06-15 14:13 ` [PATCH v8 04/10] vhost-iotlb: Add an opaque pointer for vhost IOTLB Xie Yongji
2021-06-15 14:13 ` [PATCH v8 05/10] vdpa: Add an opaque pointer for vdpa_config_ops.dma_map() Xie Yongji
2021-06-15 14:13 ` [PATCH v8 06/10] vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap() Xie Yongji
2021-06-15 14:13 ` [PATCH v8 07/10] vdpa: Support transferring virtual addressing during DMA mapping Xie Yongji
2021-06-15 14:13 ` [PATCH v8 08/10] vduse: Implement an MMU-based IOMMU driver Xie Yongji
2021-06-15 14:13 ` [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
2021-06-21  9:13   ` Jason Wang
2021-06-21 10:41     ` Yongji Xie
2021-06-22  5:06       ` Jason Wang
2021-06-22  7:22         ` Yongji Xie
2021-06-22  7:49           ` Jason Wang
2021-06-22  8:14             ` Yongji Xie
2021-06-23  3:30               ` Jason Wang
2021-06-23  5:50                 ` Yongji Xie
2021-06-24  3:34                   ` Jason Wang
2021-06-24  4:46                     ` Yongji Xie
2021-06-24  8:13                       ` Jason Wang
2021-06-24  9:16                         ` Yongji Xie
2021-06-25  3:08                           ` Jason Wang
2021-06-25  4:19                             ` Yongji Xie
2021-06-28  4:40                               ` Jason Wang
2021-06-29  2:26                                 ` Yongji Xie
2021-06-29  3:29                                   ` Jason Wang
2021-06-29  3:56                                     ` Yongji Xie
2021-06-29  4:03                                       ` Jason Wang
2021-06-24 14:46   ` Stefan Hajnoczi
2021-06-29  2:59     ` Yongji Xie
2021-06-30  9:51       ` Stefan Hajnoczi
2021-07-01  6:50         ` Yongji Xie
2021-07-01  7:55           ` Jason Wang
2021-07-01 10:26             ` Yongji Xie
2021-07-02  3:25               ` Jason Wang
2021-07-07  8:52   ` Stefan Hajnoczi
2021-07-07  9:19     ` Yongji Xie
2021-06-15 14:13 ` [PATCH v8 10/10] Documentation: Add documentation for VDUSE Xie Yongji
2021-06-24 13:01   ` Stefan Hajnoczi
2021-06-29  5:43     ` Yongji Xie
2021-06-30 10:06       ` Stefan Hajnoczi
2021-07-01 10:00         ` Yongji Xie
2021-07-01 13:15           ` Stefan Hajnoczi
2021-07-04  9:49             ` Yongji Xie
2021-07-05  3:36               ` Jason Wang
2021-07-05 12:49                 ` Stefan Hajnoczi
2021-07-06  2:34                   ` Jason Wang
2021-07-06 10:14                     ` Stefan Hajnoczi
     [not found]                       ` <CACGkMEs2HHbUfarum8uQ6wuXoDwLQUSXTsa-huJFiqr__4cwRg@mail.gmail.com>
     [not found]                         ` <YOSOsrQWySr0andk@stefanha-x1.localdomain>
     [not found]                           ` <100e6788-7fdf-1505-d69c-bc28a8bc7a78@redhat.com>
     [not found]                             ` <YOVr801d01YOPzLL@stefanha-x1.localdomain>
2021-07-07  9:24                               ` Jason Wang
2021-07-07 15:54                                 ` Stefan Hajnoczi [this message]
2021-07-08  4:17                                   ` Jason Wang
2021-07-08  9:06                                     ` Stefan Hajnoczi
2021-07-08 12:35                                       ` Yongji Xie
2021-07-06  3:04                   ` Yongji Xie
2021-07-06 10:22                     ` Stefan Hajnoczi
2021-07-07  9:09                       ` Yongji Xie
2021-07-08  9:07                         ` Stefan Hajnoczi
2021-06-24 15:12 ` [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace Stefan Hajnoczi
2021-06-29  3:15   ` Yongji Xie
2021-06-28 10:33 ` Liu Xiaodong
2021-06-28  4:35   ` Jason Wang
2021-06-28  5:54     ` Liu, Xiaodong
2021-06-29  4:10       ` Jason Wang
2021-06-29  7:56         ` Liu, Xiaodong
2021-06-29  8:14           ` Yongji Xie
2021-06-28 10:32   ` Yongji Xie
2021-06-29  4:12     ` Jason Wang
2021-06-29  6:40       ` Yongji Xie
2021-06-29  7:33         ` Jason Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YOXOMiPl7mKd7FoM@stefanha-x1.localdomain \
    --to=stefanha@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=bcrl@kvack.org \
    --cc=christian.brauner@canonical.com \
    --cc=corbet@lwn.net \
    --cc=dan.carpenter@oracle.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hch@infradead.org \
    --cc=iommu@lists.linux-foundation.org \
    --cc=jasowang@redhat.com \
    --cc=joro@8bytes.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mika.penttila@nextfour.com \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=parav@nvidia.com \
    --cc=rdunlap@infradead.org \
    --cc=sgarzare@redhat.com \
    --cc=songmuchun@bytedance.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=virtualization@lists.linux-foundation.org \
    --cc=willy@infradead.org \
    --cc=xieyongji@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).