netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Yongji Xie <xieyongji@bytedance.com>
To: Robin Murphy <robin.murphy@arm.com>
Cc: kvm <kvm@vger.kernel.org>, "Michael S. Tsirkin" <mst@redhat.com>,
	"Jason Wang" <jasowang@redhat.com>,
	virtualization <virtualization@lists.linux-foundation.org>,
	"Christian Brauner" <christian.brauner@canonical.com>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Christoph Hellwig" <hch@infradead.org>,
	"Dan Carpenter" <dan.carpenter@oracle.com>,
	"Stefano Garzarella" <sgarzare@redhat.com>,
	"Liu Xiaodong" <xiaodong.liu@intel.com>,
	"Joe Perches" <joe@perches.com>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	"Stefan Hajnoczi" <stefanha@redhat.com>,
	songmuchun@bytedance.com, "Jens Axboe" <axboe@kernel.dk>,
	"He Zhe" <zhe.he@windriver.com>,
	"Greg KH" <gregkh@linuxfoundation.org>,
	"Randy Dunlap" <rdunlap@infradead.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	iommu@lists.linux-foundation.org, bcrl@kvack.org,
	netdev@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	"Mika Penttilä" <mika.penttila@nextfour.com>
Subject: Re: [PATCH v10 01/17] iova: Export alloc_iova_fast() and free_iova_fast()
Date: Thu, 5 Aug 2021 20:34:01 +0800	[thread overview]
Message-ID: <CACycT3vARzvd4-dkZhDHqUkeYoSxTa2ty0z0ivE1znGti+n1-g@mail.gmail.com> (raw)
In-Reply-To: <417ce5af-4deb-5319-78ce-b74fb4dd0582@arm.com>

On Wed, Aug 4, 2021 at 11:43 PM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 2021-08-04 06:02, Yongji Xie wrote:
> > On Tue, Aug 3, 2021 at 6:54 PM Robin Murphy <robin.murphy@arm.com> wrote:
> >>
> >> On 2021-08-03 09:54, Yongji Xie wrote:
> >>> On Tue, Aug 3, 2021 at 3:41 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>
> >>>>
> >>>> 在 2021/7/29 下午3:34, Xie Yongji 写道:
> >>>>> Export alloc_iova_fast() and free_iova_fast() so that
> >>>>> some modules can use it to improve iova allocation efficiency.
> >>>>
> >>>>
> >>>> It's better to explain why alloc_iova() is not sufficient here.
> >>>>
> >>>
> >>> Fine.
> >>
> >> What I fail to understand from the later patches is what the IOVA domain
> >> actually represents. If the "device" is a userspace process then
> >> logically the "IOVA" would be the userspace address, so presumably
> >> somewhere you're having to translate between this arbitrary address
> >> space and actual usable addresses - if you're worried about efficiency
> >> surely it would be even better to not do that?
> >>
> >
> > Yes, userspace daemon needs to translate the "IOVA" in a DMA
> > descriptor to the VA (from mmap(2)). But this actually doesn't affect
> > performance since it's an identical mapping in most cases.
>
> I'm not familiar with the vhost_iotlb stuff, but it looks suspiciously
> like you're walking yet another tree to make those translations. Even if
> the buffer can be mapped all at once with a fixed offset such that each
> DMA mapping call doesn't need a lookup for each individual "IOVA" - that
> might be what's happening already, but it's a bit hard to follow just
> reading the patches in my mail client - vhost_iotlb_add_range() doesn't
> look like it's super-cheap to call, and you're serialising on a lock for
> that.
>

Yes, that's true. Since the software IOTLB is not used in the VM case,
we need a unified way (vhost_iotlb) to manage the IOVA mapping for
both VM and Container cases.

> My main point, though, is that if you've already got something else
> keeping track of the actual addresses, then the way you're using an
> iova_domain appears to be something you could do with a trivial bitmap
> allocator. That's why I don't buy the efficiency argument. The main
> design points of the IOVA allocator are to manage large address spaces
> while trying to maximise spatial locality to minimise the underlying
> pagetable usage, and allocating with a flexible limit to support
> multiple devices with different addressing capabilities in the same
> address space. If none of those aspects are relevant to the use-case -
> which AFAICS appears to be true here - then as a general-purpose
> resource allocator it's rubbish and has an unreasonably massive memory
> overhead and there are many, many better choices.
>

OK, I get your point. Actually we used the genpool allocator in the
early version. Maybe we can fall back to using it.

> FWIW I've recently started thinking about moving all the caching stuff
> out of iova_domain and into the iommu-dma layer since it's now a giant
> waste of space for all the other current IOVA users.
>
> >> Presumably userspace doesn't have any concern about alignment and the
> >> things we have to worry about for the DMA API in general, so it's pretty
> >> much just allocating slots in a buffer, and there are far more effective
> >> ways to do that than a full-blown address space manager.
> >
> > Considering iova allocation efficiency, I think the iova allocator is
> > better here. In most cases, we don't even need to hold a spin lock
> > during iova allocation.
> >
> >> If you're going
> >> to reuse any infrastructure I'd have expected it to be SWIOTLB rather
> >> than the IOVA allocator. Because, y'know, you're *literally implementing
> >> a software I/O TLB* ;)
> >>
> >
> > But actually what we can reuse in SWIOTLB is the IOVA allocator.
>
> Huh? Those are completely unrelated and orthogonal things - SWIOTLB does
> not use an external allocator (see find_slots()). By SWIOTLB I mean
> specifically the library itself, not dma-direct or any of the other
> users built around it. The functionality for managing slots in a buffer
> and bouncing data in and out can absolutely be reused - that's why users
> like the Xen and iommu-dma code *are* reusing it instead of open-coding
> their own versions.
>

I see. Actually the slots management in SWIOTLB is what I mean by IOVA
allocator.

> > And
> > the IOVA management in SWIOTLB is not what we want. For example,
> > SWIOTLB allocates and uses contiguous memory for bouncing, which is
> > not necessary in VDUSE case.
>
> alloc_iova() allocates a contiguous (in IOVA address) region of space.
> In vduse_domain_map_page() you use it to allocate a contiguous region of
> space from your bounce buffer. Can you clarify how that is fundamentally
> different from allocating a contiguous region of space from a bounce
> buffer? Nobody's saying the underlying implementation details of where
> the buffer itself comes from can't be tweaked.
>

I mean physically contiguous memory here. We can currently allocate
the bounce pages one by one rather than allocating a bunch of
physically contiguous memory at once which is not friendly to a
userspace device.

> > And VDUSE needs coherent mapping which is
> > not supported by the SWIOTLB. Besides, the SWIOTLB works in singleton
> > mode (designed for platform IOMMU) , but VDUSE is based on on-chip
> > IOMMU (supports multiple instances).
> That's not entirely true - the IOMMU bounce buffering scheme introduced
> in intel-iommu and now moved into the iommu-dma layer was already a step
> towards something conceptually similar. It does still rely on stealing
> the underlying pages from the global SWIOTLB pool at the moment, but the
> bouncing is effectively done in a per-IOMMU-domain context.
>
> The next step is currently queued in linux-next, wherein we can now have
> individual per-device SWIOTLB pools. In fact at that point I think you
> might actually be able to do your thing without implementing any special
> DMA ops at all - you'd need to set up a pool for your "device" with
> force_bounce set, then when you mmap() that to userspace, set up
> dev->dma_range_map to describe an offset from the physical address of
> the buffer to the userspace address, and I think dma-direct would be
> tricked into doing the right thing. It's a bit wacky, but it could stand
> to save a hell of a lot of bother.
>

Cool! I missed this work, sorry. But it looks like its current version
can't meet our needs (e.g. avoid using physically contiguous memory).
So I'd like to consider it as a follow-up optimization and use a
general IOVA allocator in this initial version. The IOVA allocator
would be still needed for coherent mapping
(vduse_domain_alloc_coherent() and vduse_domain_free_coherent()) after
we reuse the SWIOTLB.

> Finally, enhancing SWIOTLB to cope with virtually-mapped buffers that
> don't have to be physically contiguous is a future improvement which I
> think could benefit various use-cases - indeed it's possibly already on
> the table for IOMMU bounce pages - so would probably be welcome in general.
>

Yes, it's indeed needed by VDUSE. But I'm not sure if it would be
needed by other drivers. Looks like we need swiotlb_tbl_map_single()
to return a virtual address and introduce some way to let the caller
do some translation between VA to PA.

Thanks,
Yongji

  reply	other threads:[~2021-08-05 12:34 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-29  7:34 [PATCH v10 00/17] Introduce VDUSE - vDPA Device in Userspace Xie Yongji
2021-07-29  7:34 ` [PATCH v10 01/17] iova: Export alloc_iova_fast() and free_iova_fast() Xie Yongji
2021-08-03  7:41   ` Jason Wang
2021-08-03  7:41   ` Jason Wang
2021-08-03  8:54     ` Yongji Xie
2021-08-03 10:53       ` Robin Murphy
2021-08-04  5:02         ` Yongji Xie
2021-08-04 15:43           ` Robin Murphy
2021-08-05 12:34             ` Yongji Xie [this message]
2021-08-05 13:31               ` Jason Wang
2021-08-09  5:56                 ` Yongji Xie
2021-08-10  3:02                   ` Jason Wang
2021-08-10  7:43                     ` Yongji Xie
2021-07-29  7:34 ` [PATCH v10 02/17] file: Export receive_fd() to modules Xie Yongji
2021-08-03  7:45   ` Jason Wang
2021-08-03  9:01     ` Yongji Xie
2021-08-04  8:27       ` Jason Wang
2021-07-29  7:34 ` [PATCH v10 03/17] vdpa: Fix code indentation Xie Yongji
2021-08-03  7:50   ` Jason Wang
2021-08-03  9:13     ` Yongji Xie
2021-07-29  7:34 ` [PATCH v10 04/17] vdpa: Fail the vdpa_reset() if fail to set device status to zero Xie Yongji
2021-08-03  7:58   ` Jason Wang
2021-08-03  9:31     ` Yongji Xie
2021-08-04  8:30       ` Jason Wang
2021-07-29  7:34 ` [PATCH v10 05/17] vhost-vdpa: Fail the vhost_vdpa_set_status() on reset failure Xie Yongji
2021-08-03  8:10   ` Jason Wang
2021-08-03  9:50     ` Yongji Xie
2021-08-04  8:33       ` Jason Wang
2021-07-29  7:34 ` [PATCH v10 06/17] vhost-vdpa: Handle the failure of vdpa_reset() Xie Yongji
2021-07-29  7:34 ` [PATCH v10 07/17] virtio: Don't set FAILED status bit on device index allocation failure Xie Yongji
2021-08-03  8:02   ` Jason Wang
2021-08-03  9:17     ` Yongji Xie
2021-07-29  7:34 ` [PATCH v10 08/17] virtio_config: Add a return value to reset function Xie Yongji
2021-07-29  7:34 ` [PATCH v10 09/17] virtio-vdpa: Handle the failure of vdpa_reset() Xie Yongji
2021-07-29  7:34 ` [PATCH v10 10/17] virtio: Handle device reset failure in register_virtio_device() Xie Yongji
2021-08-03  8:09   ` Jason Wang
2021-08-03  9:38     ` Yongji Xie
2021-08-04  8:32       ` Jason Wang
2021-08-04  8:50         ` Yongji Xie
2021-08-04  8:54           ` Jason Wang
2021-08-04  9:07             ` Yongji Xie
2021-08-05  7:12               ` Jason Wang
2021-07-29  7:34 ` [PATCH v10 11/17] vhost-iotlb: Add an opaque pointer for vhost IOTLB Xie Yongji
2021-07-29  7:34 ` [PATCH v10 12/17] vdpa: Add an opaque pointer for vdpa_config_ops.dma_map() Xie Yongji
2021-07-29  7:34 ` [PATCH v10 13/17] vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap() Xie Yongji
2021-07-29  7:35 ` [PATCH v10 14/17] vdpa: Support transferring virtual addressing during DMA mapping Xie Yongji
2021-07-29  7:35 ` [PATCH v10 15/17] vduse: Implement an MMU-based software IOTLB Xie Yongji
2021-07-29  7:35 ` [PATCH v10 16/17] vduse: Introduce VDUSE - vDPA Device in Userspace Xie Yongji
2021-07-29  9:00   ` Greg KH
2021-07-29  9:57     ` Yongji Xie
2021-08-03  7:30   ` Jason Wang
2021-08-03  8:39     ` Yongji Xie
2021-07-29  7:35 ` [PATCH v10 17/17] Documentation: Add documentation for VDUSE Xie Yongji
2021-08-03  7:35   ` Jason Wang
2021-08-03  8:52     ` Yongji Xie

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CACycT3vARzvd4-dkZhDHqUkeYoSxTa2ty0z0ivE1znGti+n1-g@mail.gmail.com \
    --to=xieyongji@bytedance.com \
    --cc=axboe@kernel.dk \
    --cc=bcrl@kvack.org \
    --cc=christian.brauner@canonical.com \
    --cc=corbet@lwn.net \
    --cc=dan.carpenter@oracle.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hch@infradead.org \
    --cc=iommu@lists.linux-foundation.org \
    --cc=jasowang@redhat.com \
    --cc=joe@perches.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mika.penttila@nextfour.com \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=rdunlap@infradead.org \
    --cc=robin.murphy@arm.com \
    --cc=sgarzare@redhat.com \
    --cc=songmuchun@bytedance.com \
    --cc=stefanha@redhat.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=virtualization@lists.linux-foundation.org \
    --cc=willy@infradead.org \
    --cc=xiaodong.liu@intel.com \
    --cc=zhe.he@windriver.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).