linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Yongji Xie <xieyongji@bytedance.com>
To: Stefan Hajnoczi <stefanha@redhat.com>, Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>,
	"Stefano Garzarella" <sgarzare@redhat.com>,
	"Parav Pandit" <parav@nvidia.com>,
	"Christoph Hellwig" <hch@infradead.org>,
	"Christian Brauner" <christian.brauner@canonical.com>,
	"Randy Dunlap" <rdunlap@infradead.org>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	"Jens Axboe" <axboe@kernel.dk>,
	bcrl@kvack.org, "Jonathan Corbet" <corbet@lwn.net>,
	"Mika Penttilä" <mika.penttila@nextfour.com>,
	"Dan Carpenter" <dan.carpenter@oracle.com>,
	joro@8bytes.org, "Greg KH" <gregkh@linuxfoundation.org>,
	songmuchun@bytedance.com,
	virtualization <virtualization@lists.linux-foundation.org>,
	netdev@vger.kernel.org, kvm <kvm@vger.kernel.org>,
	linux-fsdevel@vger.kernel.org, iommu@lists.linux-foundation.org,
	linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: Re: Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE
Date: Sun, 4 Jul 2021 17:49:15 +0800	[thread overview]
Message-ID: <CACycT3vo-diHgTSLw_FS2E+5ia5VjihE3qw7JmZR7JT55P-wQA@mail.gmail.com> (raw)
In-Reply-To: <YN3ABqCMLQf7ejOm@stefanha-x1.localdomain>

On Thu, Jul 1, 2021 at 9:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Jul 01, 2021 at 06:00:48PM +0800, Yongji Xie wrote:
> > On Wed, Jun 30, 2021 at 6:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Tue, Jun 29, 2021 at 01:43:11PM +0800, Yongji Xie wrote:
> > > > On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> > > > > > +     static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len)
> > > > > > +     {
> > > > > > +             int fd;
> > > > > > +             void *addr;
> > > > > > +             size_t size;
> > > > > > +             struct vduse_iotlb_entry entry;
> > > > > > +
> > > > > > +             entry.start = iova;
> > > > > > +             entry.last = iova + 1;
> > > > >
> > > > > Why +1?
> > > > >
> > > > > I expected the request to include *len so that VDUSE can create a bounce
> > > > > buffer for the full iova range, if necessary.
> > > > >
> > > >
> > > > The function is used to translate iova to va. And the *len is not
> > > > specified by the caller. Instead, it's used to tell the caller the
> > > > length of the contiguous iova region from the specified iova. And the
> > > > ioctl VDUSE_IOTLB_GET_FD will get the file descriptor to the first
> > > > overlapped iova region. So using iova + 1 should be enough here.
> > >
> > > Does the entry.last field have any purpose with VDUSE_IOTLB_GET_FD? I
> > > wonder why userspace needs to assign a value at all if it's always +1.
> > >
> >
> > If we need to get some iova regions in the specified range, we need
> > the entry.last field. For example, we can use [0, ULONG_MAX] to get
> > the first overlapped iova region which might be [4096, 8192]. But in
> > this function, we don't use VDUSE_IOTLB_GET_FD like this. We need to
> > get the iova region including the specified iova.
>
> I see, thanks for explaining!
>
> > > > > > +             return addr + iova - entry.start;
> > > > > > +     }
> > > > > > +
> > > > > > +- VDUSE_DEV_GET_FEATURES: Get the negotiated features
> > > > >
> > > > > Are these VIRTIO feature bits? Please explain how feature negotiation
> > > > > works. There must be a way for userspace to report the device's
> > > > > supported feature bits to the kernel.
> > > > >
> > > >
> > > > Yes, these are VIRTIO feature bits. Userspace will specify the
> > > > device's supported feature bits when creating a new VDUSE device with
> > > > ioctl(VDUSE_CREATE_DEV).
> > >
> > > Can the VDUSE device influence feature bit negotiation? For example, if
> > > the VDUSE virtio-blk device does not implement discard/write-zeroes, how
> > > does QEMU or the guest find out about this?
> > >
> >
> > There is a "features" field in struct vduse_dev_config which is used
> > to do feature negotiation.
>
> This approach is more restrictive than required by the VIRTIO
> specification:
>
>   "The device SHOULD accept any valid subset of features the driver
>   accepts, otherwise it MUST fail to set the FEATURES_OK device status
>   bit when the driver writes it."
>
>   https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-130002
>
> The spec allows a device to reject certain subsets of features. For
> example, if feature B depends on feature A and can only be enabled when
> feature A is also enabled.
>
> From your description I think VDUSE would accept feature B without
> feature A since the device implementation has no opportunity to fail
> negotiation with custom logic.
>

Yes, we discussed it [1] before. So I'd like to re-introduce
SET_STATUS messages so that the userspace can fail feature negotiation
during setting FEATURES_OK status bit.

[1]  https://lkml.org/lkml/2021/6/28/1587

> Ideally VDUSE would send a SET_FEATURES message to userspace, allowing
> the device implementation full flexibility in which subsets of features
> to accept.
>
> This is a corner case. Many or maybe even all existing VIRTIO devices
> don't need this flexibility, but I want to point out this limitation in
> the VDUSE interface because it may cause issues in the future.
>
> > > > > > +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a config interrupt
> > > > >
> > > > > Does this mean the contents of the configuration space are cached by
> > > > > VDUSE?
> > > >
> > > > Yes, but the kernel will also store the same contents.
> > > >
> > > > > The downside is that the userspace code cannot generate the
> > > > > contents on demand. Most devices doin't need to generate the contents
> > > > > on demand, so I think this is okay but I had expected a different
> > > > > interface:
> > > > >
> > > > > kernel->userspace VDUSE_DEV_GET_CONFIG
> > > > > userspace->kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> > > > >
> > > >
> > > > The problem is how to handle the failure of VDUSE_DEV_GET_CONFIG. We
> > > > will need lots of modification of virtio codes to support that. So to
> > > > make it simple, we choose this way:
> > > >
> > > > userspace -> kernel VDUSE_DEV_SET_CONFIG
> > > > userspace -> kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> > > >
> > > > > I think you can leave it the way it is, but I wanted to mention this in
> > > > > case someone thinks it's important to support generating the contents of
> > > > > the configuration space on demand.
> > > > >
> > > >
> > > > Sorry, I didn't get you here. Can't VDUSE_DEV_SET_CONFIG and
> > > > VDUSE_DEV_INJECT_CONFIG_IRQ achieve that?
> > >
> > > If the contents of the configuration space change continuously, then the
> > > VDUSE_DEV_SET_CONFIG approach is inefficient and might have race
> > > conditions. For example, imagine a device where the driver can read a
> > > timer from the configuration space. I think the VIRTIO device model
> > > allows that although I'm not aware of any devices that do something like
> > > it today. The problem is that VDUSE_DEV_SET_CONFIG would have to be
> > > called frequently to keep the timer value updated even though the guest
> > > driver probably isn't accessing it.
> > >
> >
> > OK, I get you now. Since the VIRTIO specification says "Device
> > configuration space is generally used for rarely-changing or
> > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > ioctl should not be called frequently.
>
> The spec uses MUST and other terms to define the precise requirements.
> Here the language (especially the word "generally") is weaker and means
> there may be exceptions.
>
> Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> approach is reads that have side-effects. For example, imagine a field
> containing an error code if the device encounters a problem unrelated to
> a specific virtqueue request. Reading from this field resets the error
> code to 0, saving the driver an extra configuration space write access
> and possibly race conditions. It isn't possible to implement those
> semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> makes me think that the interface does not allow full VIRTIO semantics.
>

Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
handle the message failure, I'm going to add a return value to
virtio_config_ops.get() and virtio_cread_* API so that the error can
be propagated to the virtio device driver. Then the virtio-blk device
driver can be modified to handle that.

Jason and Stefan, what do you think of this way?

> > > What's worse is that there might be race conditions where other
> > > driver->device operations are supposed to update the configuration space
> > > but VDUSE_DEV_SET_CONFIG means that the VDUSE kernel code is caching an
> > > outdated copy.
> > >
> >
> > I'm not sure. Should the device and driver be able to access the same
> > fields concurrently?
>
> Yes. The VIRTIO spec has a generation count to handle multi-field
> accesses so that consistency can be ensured:
> https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-180004
>

I see.

Thanks,
Yongji

  reply	other threads:[~2021-07-04  9:49 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-15 14:13 [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
2021-06-15 14:13 ` [PATCH v8 01/10] iova: Export alloc_iova_fast() and free_iova_fast(); Xie Yongji
2021-06-15 14:13 ` [PATCH v8 02/10] file: Export receive_fd() to modules Xie Yongji
2021-06-15 14:13 ` [PATCH v8 03/10] eventfd: Increase the recursion depth of eventfd_signal() Xie Yongji
2021-06-17  8:33   ` He Zhe
2021-06-18  3:29     ` Yongji Xie
2021-06-18  8:41       ` He Zhe
2021-06-18  8:44       ` [PATCH] eventfd: Enlarge recursion limit to allow vhost to work He Zhe
2021-07-03  8:31         ` Michael S. Tsirkin
2021-08-25  7:57         ` Yongji Xie
2021-06-15 14:13 ` [PATCH v8 04/10] vhost-iotlb: Add an opaque pointer for vhost IOTLB Xie Yongji
2021-06-15 14:13 ` [PATCH v8 05/10] vdpa: Add an opaque pointer for vdpa_config_ops.dma_map() Xie Yongji
2021-06-15 14:13 ` [PATCH v8 06/10] vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap() Xie Yongji
2021-06-15 14:13 ` [PATCH v8 07/10] vdpa: Support transferring virtual addressing during DMA mapping Xie Yongji
2021-06-15 14:13 ` [PATCH v8 08/10] vduse: Implement an MMU-based IOMMU driver Xie Yongji
2021-06-15 14:13 ` [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
2021-06-21  9:13   ` Jason Wang
2021-06-21 10:41     ` Yongji Xie
2021-06-22  5:06       ` Jason Wang
2021-06-22  7:22         ` Yongji Xie
2021-06-22  7:49           ` Jason Wang
2021-06-22  8:14             ` Yongji Xie
2021-06-23  3:30               ` Jason Wang
2021-06-23  5:50                 ` Yongji Xie
2021-06-24  3:34                   ` Jason Wang
2021-06-24  4:46                     ` Yongji Xie
2021-06-24  8:13                       ` Jason Wang
2021-06-24  9:16                         ` Yongji Xie
2021-06-25  3:08                           ` Jason Wang
2021-06-25  4:19                             ` Yongji Xie
2021-06-28  4:40                               ` Jason Wang
2021-06-29  2:26                                 ` Yongji Xie
2021-06-29  3:29                                   ` Jason Wang
2021-06-29  3:56                                     ` Yongji Xie
2021-06-29  4:03                                       ` Jason Wang
2021-06-24 14:46   ` Stefan Hajnoczi
2021-06-29  2:59     ` Yongji Xie
2021-06-30  9:51       ` Stefan Hajnoczi
2021-07-01  6:50         ` Yongji Xie
2021-07-01  7:55           ` Jason Wang
2021-07-01 10:26             ` Yongji Xie
2021-07-02  3:25               ` Jason Wang
2021-07-07  8:52   ` Stefan Hajnoczi
2021-07-07  9:19     ` Yongji Xie
2021-06-15 14:13 ` [PATCH v8 10/10] Documentation: Add documentation for VDUSE Xie Yongji
2021-06-24 13:01   ` Stefan Hajnoczi
2021-06-29  5:43     ` Yongji Xie
2021-06-30 10:06       ` Stefan Hajnoczi
2021-07-01 10:00         ` Yongji Xie
2021-07-01 13:15           ` Stefan Hajnoczi
2021-07-04  9:49             ` Yongji Xie [this message]
2021-07-05  3:36               ` Jason Wang
2021-07-05 12:49                 ` Stefan Hajnoczi
2021-07-06  2:34                   ` Jason Wang
2021-07-06 10:14                     ` Stefan Hajnoczi
     [not found]                       ` <CACGkMEs2HHbUfarum8uQ6wuXoDwLQUSXTsa-huJFiqr__4cwRg@mail.gmail.com>
     [not found]                         ` <YOSOsrQWySr0andk@stefanha-x1.localdomain>
     [not found]                           ` <100e6788-7fdf-1505-d69c-bc28a8bc7a78@redhat.com>
     [not found]                             ` <YOVr801d01YOPzLL@stefanha-x1.localdomain>
2021-07-07  9:24                               ` Jason Wang
2021-07-07 15:54                                 ` Stefan Hajnoczi
2021-07-08  4:17                                   ` Jason Wang
2021-07-08  9:06                                     ` Stefan Hajnoczi
2021-07-08 12:35                                       ` Yongji Xie
2021-07-06  3:04                   ` Yongji Xie
2021-07-06 10:22                     ` Stefan Hajnoczi
2021-07-07  9:09                       ` Yongji Xie
2021-07-08  9:07                         ` Stefan Hajnoczi
2021-06-24 15:12 ` [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace Stefan Hajnoczi
2021-06-29  3:15   ` Yongji Xie
2021-06-28 10:33 ` Liu Xiaodong
2021-06-28  4:35   ` Jason Wang
2021-06-28  5:54     ` Liu, Xiaodong
2021-06-29  4:10       ` Jason Wang
2021-06-29  7:56         ` Liu, Xiaodong
2021-06-29  8:14           ` Yongji Xie
2021-06-28 10:32   ` Yongji Xie
2021-06-29  4:12     ` Jason Wang
2021-06-29  6:40       ` Yongji Xie
2021-06-29  7:33         ` Jason Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CACycT3vo-diHgTSLw_FS2E+5ia5VjihE3qw7JmZR7JT55P-wQA@mail.gmail.com \
    --to=xieyongji@bytedance.com \
    --cc=axboe@kernel.dk \
    --cc=bcrl@kvack.org \
    --cc=christian.brauner@canonical.com \
    --cc=corbet@lwn.net \
    --cc=dan.carpenter@oracle.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hch@infradead.org \
    --cc=iommu@lists.linux-foundation.org \
    --cc=jasowang@redhat.com \
    --cc=joro@8bytes.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mika.penttila@nextfour.com \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=parav@nvidia.com \
    --cc=rdunlap@infradead.org \
    --cc=sgarzare@redhat.com \
    --cc=songmuchun@bytedance.com \
    --cc=stefanha@redhat.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=virtualization@lists.linux-foundation.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).