All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Jag Raman <jag.raman@oracle.com>
Cc: "eduardo@habkost.net" <eduardo@habkost.net>,
	"Elena Ufimtseva" <elena.ufimtseva@oracle.com>,
	"Daniel P. Berrangé" <berrange@redhat.com>,
	"Beraldo Leal" <bleal@redhat.com>,
	"John Johnson" <john.g.johnson@oracle.com>,
	"quintela@redhat.com" <quintela@redhat.com>,
	qemu-devel <qemu-devel@nongnu.org>,
	"armbru@redhat.com" <armbru@redhat.com>,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Marc-André Lureau" <marcandre.lureau@gmail.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	"Stefan Hajnoczi" <stefanha@redhat.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"thanos.makatos@nutanix.com" <thanos.makatos@nutanix.com>,
	"Eric Blake" <eblake@redhat.com>,
	"john.levon@nutanix.com" <john.levon@nutanix.com>,
	"Philippe Mathieu-Daudé" <f4bug@amsat.org>
Subject: Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
Date: Thu, 10 Feb 2022 17:53:12 -0500	[thread overview]
Message-ID: <20220210175040-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <9E989878-326F-4E72-85DD-34D1CB72F0F8@oracle.com>

On Thu, Feb 10, 2022 at 10:23:01PM +0000, Jag Raman wrote:
> 
> 
> > On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > On Thu, Feb 10, 2022 at 12:08:27AM +0000, Jag Raman wrote:
> >> 
> >> 
> >>> On Feb 2, 2022, at 12:34 AM, Alex Williamson <alex.williamson@redhat.com> wrote:
> >>> 
> >>> On Wed, 2 Feb 2022 01:13:22 +0000
> >>> Jag Raman <jag.raman@oracle.com> wrote:
> >>> 
> >>>>> On Feb 1, 2022, at 5:47 PM, Alex Williamson <alex.williamson@redhat.com> wrote:
> >>>>> 
> >>>>> On Tue, 1 Feb 2022 21:24:08 +0000
> >>>>> Jag Raman <jag.raman@oracle.com> wrote:
> >>>>> 
> >>>>>>> On Feb 1, 2022, at 10:24 AM, Alex Williamson <alex.williamson@redhat.com> wrote:
> >>>>>>> 
> >>>>>>> On Tue, 1 Feb 2022 09:30:35 +0000
> >>>>>>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>>>>> 
> >>>>>>>> On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote:    
> >>>>>>>>> On Fri, 28 Jan 2022 09:18:08 +0000
> >>>>>>>>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>>>>>>> 
> >>>>>>>>>> On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:      
> >>>>>>>>>>> If the goal here is to restrict DMA between devices, ie. peer-to-peer
> >>>>>>>>>>> (p2p), why are we trying to re-invent what an IOMMU already does?        
> >>>>>>>>>> 
> >>>>>>>>>> The issue Dave raised is that vfio-user servers run in separate
> >>>>>>>>>> processses from QEMU with shared memory access to RAM but no direct
> >>>>>>>>>> access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
> >>>>>>>>>> example of a non-RAM MemoryRegion that can be the source/target of DMA
> >>>>>>>>>> requests.
> >>>>>>>>>> 
> >>>>>>>>>> I don't think IOMMUs solve this problem but luckily the vfio-user
> >>>>>>>>>> protocol already has messages that vfio-user servers can use as a
> >>>>>>>>>> fallback when DMA cannot be completed through the shared memory RAM
> >>>>>>>>>> accesses.
> >>>>>>>>>> 
> >>>>>>>>>>> In
> >>>>>>>>>>> fact, it seems like an IOMMU does this better in providing an IOVA
> >>>>>>>>>>> address space per BDF.  Is the dynamic mapping overhead too much?  What
> >>>>>>>>>>> physical hardware properties or specifications could we leverage to
> >>>>>>>>>>> restrict p2p mappings to a device?  Should it be governed by machine
> >>>>>>>>>>> type to provide consistency between devices?  Should each "isolated"
> >>>>>>>>>>> bus be in a separate root complex?  Thanks,        
> >>>>>>>>>> 
> >>>>>>>>>> There is a separate issue in this patch series regarding isolating the
> >>>>>>>>>> address space where BAR accesses are made (i.e. the global
> >>>>>>>>>> address_space_memory/io). When one process hosts multiple vfio-user
> >>>>>>>>>> server instances (e.g. a software-defined network switch with multiple
> >>>>>>>>>> ethernet devices) then each instance needs isolated memory and io address
> >>>>>>>>>> spaces so that vfio-user clients don't cause collisions when they map
> >>>>>>>>>> BARs to the same address.
> >>>>>>>>>> 
> >>>>>>>>>> I think the the separate root complex idea is a good solution. This
> >>>>>>>>>> patch series takes a different approach by adding the concept of
> >>>>>>>>>> isolated address spaces into hw/pci/.      
> >>>>>>>>> 
> >>>>>>>>> This all still seems pretty sketchy, BARs cannot overlap within the
> >>>>>>>>> same vCPU address space, perhaps with the exception of when they're
> >>>>>>>>> being sized, but DMA should be disabled during sizing.
> >>>>>>>>> 
> >>>>>>>>> Devices within the same VM context with identical BARs would need to
> >>>>>>>>> operate in different address spaces.  For example a translation offset
> >>>>>>>>> in the vCPU address space would allow unique addressing to the devices,
> >>>>>>>>> perhaps using the translation offset bits to address a root complex and
> >>>>>>>>> masking those bits for downstream transactions.
> >>>>>>>>> 
> >>>>>>>>> In general, the device simply operates in an address space, ie. an
> >>>>>>>>> IOVA.  When a mapping is made within that address space, we perform a
> >>>>>>>>> translation as necessary to generate a guest physical address.  The
> >>>>>>>>> IOVA itself is only meaningful within the context of the address space,
> >>>>>>>>> there is no requirement or expectation for it to be globally unique.
> >>>>>>>>> 
> >>>>>>>>> If the vfio-user server is making some sort of requirement that IOVAs
> >>>>>>>>> are unique across all devices, that seems very, very wrong.  Thanks,      
> >>>>>>>> 
> >>>>>>>> Yes, BARs and IOVAs don't need to be unique across all devices.
> >>>>>>>> 
> >>>>>>>> The issue is that there can be as many guest physical address spaces as
> >>>>>>>> there are vfio-user clients connected, so per-client isolated address
> >>>>>>>> spaces are required. This patch series has a solution to that problem
> >>>>>>>> with the new pci_isol_as_mem/io() API.    
> >>>>>>> 
> >>>>>>> Sorry, this still doesn't follow for me.  A server that hosts multiple
> >>>>>>> devices across many VMs (I'm not sure if you're referring to the device
> >>>>>>> or the VM as a client) needs to deal with different address spaces per
> >>>>>>> device.  The server needs to be able to uniquely identify every DMA,
> >>>>>>> which must be part of the interface protocol.  But I don't see how that
> >>>>>>> imposes a requirement of an isolated address space.  If we want the
> >>>>>>> device isolated because we don't trust the server, that's where an IOMMU
> >>>>>>> provides per device isolation.  What is the restriction of the
> >>>>>>> per-client isolated address space and why do we need it?  The server
> >>>>>>> needing to support multiple clients is not a sufficient answer to
> >>>>>>> impose new PCI bus types with an implicit restriction on the VM.    
> >>>>>> 
> >>>>>> Hi Alex,
> >>>>>> 
> >>>>>> I believe there are two separate problems with running PCI devices in
> >>>>>> the vfio-user server. The first one is concerning memory isolation and
> >>>>>> second one is vectoring of BAR accesses (as explained below).
> >>>>>> 
> >>>>>> In our previous patches (v3), we used an IOMMU to isolate memory
> >>>>>> spaces. But we still had trouble with the vectoring. So we implemented
> >>>>>> separate address spaces for each PCIBus to tackle both problems
> >>>>>> simultaneously, based on the feedback we got.
> >>>>>> 
> >>>>>> The following gives an overview of issues concerning vectoring of
> >>>>>> BAR accesses.
> >>>>>> 
> >>>>>> The device’s BAR regions are mapped into the guest physical address
> >>>>>> space. The guest writes the guest PA of each BAR into the device’s BAR
> >>>>>> registers. To access the BAR regions of the device, QEMU uses
> >>>>>> address_space_rw() which vectors the physical address access to the
> >>>>>> device BAR region handlers.  
> >>>>> 
> >>>>> The guest physical address written to the BAR is irrelevant from the
> >>>>> device perspective, this only serves to assign the BAR an offset within
> >>>>> the address_space_mem, which is used by the vCPU (and possibly other
> >>>>> devices depending on their address space).  There is no reason for the
> >>>>> device itself to care about this address.  
> >>>> 
> >>>> Thank you for the explanation, Alex!
> >>>> 
> >>>> The confusion at my part is whether we are inside the device already when
> >>>> the server receives a request to access BAR region of a device. Based on
> >>>> your explanation, I get that your view is the BAR access request has
> >>>> propagated into the device already, whereas I was under the impression
> >>>> that the request is still on the CPU side of the PCI root complex.
> >>> 
> >>> If you are getting an access through your MemoryRegionOps, all the
> >>> translations have been made, you simply need to use the hwaddr as the
> >>> offset into the MemoryRegion for the access.  Perform the read/write to
> >>> your device, no further translations required.
> >>> 
> >>>> Your view makes sense to me - once the BAR access request reaches the
> >>>> client (on the other side), we could consider that the request has reached
> >>>> the device.
> >>>> 
> >>>> On a separate note, if devices don’t care about the values in BAR
> >>>> registers, why do the default PCI config handlers intercept and map
> >>>> the BAR region into address_space_mem?
> >>>> (pci_default_write_config() -> pci_update_mappings())
> >>> 
> >>> This is the part that's actually placing the BAR MemoryRegion as a
> >>> sub-region into the vCPU address space.  I think if you track it,
> >>> you'll see PCIDevice.io_regions[i].address_space is actually
> >>> system_memory, which is used to initialize address_space_system.
> >>> 
> >>> The machine assembles PCI devices onto buses as instructed by the
> >>> command line or hot plug operations.  It's the responsibility of the
> >>> guest firmware and guest OS to probe those devices, size the BARs, and
> >>> place the BARs into the memory hierarchy of the PCI bus, ie. system
> >>> memory.  The BARs are necessarily in the "guest physical memory" for
> >>> vCPU access, but it's essentially only coincidental that PCI devices
> >>> might be in an address space that provides a mapping to their own BAR.
> >>> There's no reason to ever use it.
> >>> 
> >>> In the vIOMMU case, we can't know that the device address space
> >>> includes those BAR mappings or if they do, that they're identity mapped
> >>> to the physical address.  Devices really need to not infer anything
> >>> about an address.  Think about real hardware, a device is told by
> >>> driver programming to perform a DMA operation.  The device doesn't know
> >>> the target of that operation, it's the guest driver's responsibility to
> >>> make sure the IOVA within the device address space is valid and maps to
> >>> the desired target.  Thanks,
> >> 
> >> Thanks for the explanation, Alex. Thanks to everyone else in the thread who
> >> helped to clarify this problem.
> >> 
> >> We have implemented the memory isolation based on the discussion in the
> >> thread. We will send the patches out shortly.
> >> 
> >> Devices such as “name" and “e1000” worked fine. But I’d like to note that
> >> the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem
> >> to be IOMMU aware. In LSI’s case, the kernel driver is asking the device to
> >> read instructions from the CPU VA (lsi_execute_script() -> read_dword()),
> >> which is forbidden when IOMMU is enabled. Specifically, the driver is asking
> >> the device to access other BAR regions by using the BAR address programmed
> >> in the PCI config space. This happens even without vfio-user patches. For example,
> >> we could enable IOMMU using “-device intel-iommu” QEMU option and also
> >> adding the following to the kernel command-line: “intel_iommu=on iommu=nopt”.
> >> In this case, we could see an IOMMU fault.
> > 
> > So, device accessing its own BAR is different. Basically, these
> > transactions never go on the bus at all, never mind get to the IOMMU.
> 
> Hi Michael,
> 
> In LSI case, I did notice that it went to the IOMMU.

Hmm do you mean you analyzed how a physical device works?
Or do you mean in QEMU?

> The device is reading the BAR
> address as if it was a DMA address.

I got that, my understanding of PCI was that a device can
not be both a master and a target of a transaction at
the same time though. Could not find this in the spec though,
maybe I remember incorrectly.

> > I think it's just used as a handle to address internal device memory.
> > This kind of trick is not universal, but not terribly unusual.
> > 
> > 
> >> Unfortunately, we started off our project with the LSI device. So that lead to all the
> >> confusion about what is expected at the server end in-terms of
> >> vectoring/address-translation. It gave an impression as if the request was still on
> >> the CPU side of the PCI root complex, but the actual problem was with the
> >> device driver itself.
> >> 
> >> I’m wondering how to deal with this problem. Would it be OK if we mapped the
> >> device’s BAR into the IOVA, at the same CPU VA programmed in the BAR registers?
> >> This would help devices such as LSI to circumvent this problem. One problem
> >> with this approach is that it has the potential to collide with another legitimate
> >> IOVA address. Kindly share your thought on this.
> >> 
> >> Thank you!
> > 
> > I am not 100% sure what do you plan to do but it sounds fine since even
> > if it collides, with traditional PCI device must never initiate cycles
> 
> OK sounds good, I’ll create a mapping of the device BARs in the IOVA.
> 
> Thank you!
> --
> Jag
> 
> > within their own BAR range, and PCIe is software-compatible with PCI. So
> > devices won't be able to access this IOVA even if it was programmed in
> > the IOMMU.
> > 
> > As was mentioned elsewhere on this thread, devices accessing each
> > other's BAR is a different matter.
> > 
> > I do not remember which rules apply to multiple functions of a
> > multi-function device though. I think in a traditional PCI
> > they will never go out on the bus, but with e.g. SRIOV they
> > would probably do go out? Alex, any idea?
> > 
> > 
> >> --
> >> Jag
> >> 
> >>> 
> >>> Alex
> >>> 
> >> 
> > 
> 



  reply	other threads:[~2022-02-10 22:56 UTC|newest]

Thread overview: 99+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-19 21:41 [PATCH v5 00/18] vfio-user server in QEMU Jagannathan Raman
2022-01-19 21:41 ` [PATCH v5 01/18] configure, meson: override C compiler for cmake Jagannathan Raman
2022-01-20 13:27   ` Paolo Bonzini
2022-01-20 15:21     ` Jag Raman
2022-02-17  6:10     ` Jag Raman
2022-01-19 21:41 ` [PATCH v5 02/18] tests/avocado: Specify target VM argument to helper routines Jagannathan Raman
2022-01-25  9:40   ` Stefan Hajnoczi
2022-01-19 21:41 ` [PATCH v5 03/18] pci: isolated address space for PCI bus Jagannathan Raman
2022-01-20  0:12   ` Michael S. Tsirkin
2022-01-20 15:20     ` Jag Raman
2022-01-25 18:38       ` Dr. David Alan Gilbert
2022-01-26  5:27         ` Jag Raman
2022-01-26  9:45           ` Stefan Hajnoczi
2022-01-26 20:07             ` Dr. David Alan Gilbert
2022-01-26 21:13               ` Michael S. Tsirkin
2022-01-27  8:30                 ` Stefan Hajnoczi
2022-01-27 12:50                   ` Michael S. Tsirkin
2022-01-27 21:22                   ` Alex Williamson
2022-01-28  8:19                     ` Stefan Hajnoczi
2022-01-28  9:18                     ` Stefan Hajnoczi
2022-01-31 16:16                       ` Alex Williamson
2022-02-01  9:30                         ` Stefan Hajnoczi
2022-02-01 15:24                           ` Alex Williamson
2022-02-01 21:24                             ` Jag Raman
2022-02-01 22:47                               ` Alex Williamson
2022-02-02  1:13                                 ` Jag Raman
2022-02-02  5:34                                   ` Alex Williamson
2022-02-02  9:22                                     ` Stefan Hajnoczi
2022-02-10  0:08                                     ` Jag Raman
2022-02-10  8:02                                       ` Michael S. Tsirkin
2022-02-10 22:23                                         ` Jag Raman
2022-02-10 22:53                                           ` Michael S. Tsirkin [this message]
2022-02-10 23:46                                             ` Jag Raman
2022-02-10 23:17                                           ` Alex Williamson
2022-02-10 23:28                                             ` Michael S. Tsirkin
2022-02-10 23:49                                               ` Alex Williamson
2022-02-11  0:26                                                 ` Michael S. Tsirkin
2022-02-11  0:54                                                   ` Jag Raman
2022-02-11  0:10                                             ` Jag Raman
2022-02-02  9:30                                 ` Peter Maydell
2022-02-02 10:06                                   ` Michael S. Tsirkin
2022-02-02 15:49                                     ` Alex Williamson
2022-02-02 16:53                                       ` Michael S. Tsirkin
2022-02-02 17:12                                   ` Alex Williamson
2022-02-01 10:42                     ` Dr. David Alan Gilbert
2022-01-26 18:13           ` Dr. David Alan Gilbert
2022-01-27 17:43             ` Jag Raman
2022-01-25  9:56   ` Stefan Hajnoczi
2022-01-25 13:49     ` Jag Raman
2022-01-25 14:19       ` Stefan Hajnoczi
2022-01-19 21:41 ` [PATCH v5 04/18] pci: create and free isolated PCI buses Jagannathan Raman
2022-01-25 10:25   ` Stefan Hajnoczi
2022-01-25 14:10     ` Jag Raman
2022-01-19 21:41 ` [PATCH v5 05/18] qdev: unplug blocker for devices Jagannathan Raman
2022-01-25 10:27   ` Stefan Hajnoczi
2022-01-25 14:43     ` Jag Raman
2022-01-26  9:32       ` Stefan Hajnoczi
2022-01-26 15:13         ` Jag Raman
2022-01-19 21:41 ` [PATCH v5 06/18] vfio-user: add HotplugHandler for remote machine Jagannathan Raman
2022-01-25 10:32   ` Stefan Hajnoczi
2022-01-25 18:12     ` Jag Raman
2022-01-26  9:35       ` Stefan Hajnoczi
2022-01-26 15:20         ` Jag Raman
2022-01-26 15:43           ` Stefan Hajnoczi
2022-01-19 21:41 ` [PATCH v5 07/18] vfio-user: set qdev bus callbacks " Jagannathan Raman
2022-01-25 10:44   ` Stefan Hajnoczi
2022-01-25 21:12     ` Jag Raman
2022-01-26  9:37       ` Stefan Hajnoczi
2022-01-26 15:51         ` Jag Raman
2022-01-19 21:41 ` [PATCH v5 08/18] vfio-user: build library Jagannathan Raman
2022-01-19 21:41 ` [PATCH v5 09/18] vfio-user: define vfio-user-server object Jagannathan Raman
2022-01-25 14:40   ` Stefan Hajnoczi
2022-01-19 21:41 ` [PATCH v5 10/18] vfio-user: instantiate vfio-user context Jagannathan Raman
2022-01-25 14:44   ` Stefan Hajnoczi
2022-01-19 21:42 ` [PATCH v5 11/18] vfio-user: find and init PCI device Jagannathan Raman
2022-01-25 14:48   ` Stefan Hajnoczi
2022-01-26  3:14     ` Jag Raman
2022-01-19 21:42 ` [PATCH v5 12/18] vfio-user: run vfio-user context Jagannathan Raman
2022-01-25 15:10   ` Stefan Hajnoczi
2022-01-26  3:26     ` Jag Raman
2022-01-19 21:42 ` [PATCH v5 13/18] vfio-user: handle PCI config space accesses Jagannathan Raman
2022-01-25 15:13   ` Stefan Hajnoczi
2022-01-19 21:42 ` [PATCH v5 14/18] vfio-user: handle DMA mappings Jagannathan Raman
2022-01-19 21:42 ` [PATCH v5 15/18] vfio-user: handle PCI BAR accesses Jagannathan Raman
2022-01-19 21:42 ` [PATCH v5 16/18] vfio-user: handle device interrupts Jagannathan Raman
2022-01-25 15:25   ` Stefan Hajnoczi
2022-01-19 21:42 ` [PATCH v5 17/18] vfio-user: register handlers to facilitate migration Jagannathan Raman
2022-01-25 15:48   ` Stefan Hajnoczi
2022-01-27 17:04     ` Jag Raman
2022-01-28  8:29       ` Stefan Hajnoczi
2022-01-28 14:49         ` Thanos Makatos
2022-02-01  3:49         ` Jag Raman
2022-02-01  9:37           ` Stefan Hajnoczi
2022-01-19 21:42 ` [PATCH v5 18/18] vfio-user: avocado tests for vfio-user Jagannathan Raman
2022-01-26  4:25   ` Philippe Mathieu-Daudé via
2022-01-26 15:12     ` Jag Raman
2022-01-25 16:00 ` [PATCH v5 00/18] vfio-user server in QEMU Stefan Hajnoczi
2022-01-26  5:04   ` Jag Raman
2022-01-26  9:56     ` Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220210175040-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=alex.williamson@redhat.com \
    --cc=armbru@redhat.com \
    --cc=berrange@redhat.com \
    --cc=bleal@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=eblake@redhat.com \
    --cc=eduardo@habkost.net \
    --cc=elena.ufimtseva@oracle.com \
    --cc=f4bug@amsat.org \
    --cc=jag.raman@oracle.com \
    --cc=john.g.johnson@oracle.com \
    --cc=john.levon@nutanix.com \
    --cc=marcandre.lureau@gmail.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=stefanha@redhat.com \
    --cc=thanos.makatos@nutanix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.