From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eduard - Gabriel Munteanu Subject: Re: [PATCH 2/7] pci: memory access API and IOMMU support Date: Thu, 2 Sep 2010 11:40:58 +0300 Message-ID: <20100902084058.GA7211@localhost> References: <1283007298-10942-1-git-send-email-eduard.munteanu@linux360.ro> <1283007298-10942-3-git-send-email-eduard.munteanu@linux360.ro> <20100902052826.GB4273@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: joro@8bytes.org, blauwirbel@gmail.com, paul@codesourcery.com, avi@redhat.com, anthony@codemonkey.ws, av1474@comtv.ru, yamahata@valinux.co.jp, kvm@vger.kernel.org, qemu-devel@nongnu.org To: "Michael S. Tsirkin" Return-path: Received: from mail-fx0-f46.google.com ([209.85.161.46]:44078 "EHLO mail-fx0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750830Ab0IBImq (ORCPT ); Thu, 2 Sep 2010 04:42:46 -0400 Received: by fxm13 with SMTP id 13so48354fxm.19 for ; Thu, 02 Sep 2010 01:42:44 -0700 (PDT) Content-Disposition: inline In-Reply-To: <20100902052826.GB4273@redhat.com> Sender: kvm-owner@vger.kernel.org List-ID: On Thu, Sep 02, 2010 at 08:28:26AM +0300, Michael S. Tsirkin wrote: > On Sat, Aug 28, 2010 at 05:54:53PM +0300, Eduard - Gabriel Munteanu wrote: > > PCI devices should access memory through pci_memory_*() instead of > > cpu_physical_memory_*(). This also provides support for translation and > > access checking in case an IOMMU is emulated. > > > > Memory maps are treated as remote IOTLBs (that is, translation caches > > belonging to the IOMMU-aware device itself). Clients (devices) must > > provide callbacks for map invalidation in case these maps are > > persistent beyond the current I/O context, e.g. AIO DMA transfers. > > > > Signed-off-by: Eduard - Gabriel Munteanu > > > I am concerned about adding more pointer chaising on data path. > Could we have > 1. an iommu pointer in a device, inherited by secondary buses > when they are created and by devices from buses when they are attached. > 2. translation pointer in the iommu instead of the bus The first solution I proposed was based on qdev, that is, each DeviceState had an 'iommu' field. Translation would be done by recursively looking in the parent bus/devs for an IOMMU. But Anthony said we're better off with bus-specific APIs, mostly because (IIRC) there may be different types of addresses and it might be difficult to abstract those properly. I suppose I could revisit the idea by integrating the IOMMU in a PCIDevice as opposed to a DeviceState. Anthony, Paul, any thoughts on this? > 3. pci_memory_XX functions inline, doing fast path for non-iommu case: > > if (__builtin_expect(!dev->iommu, 1) > return cpu_memory_rw But isn't this some sort of 'if (likely(!dev->iommu))' from the Linux kernel? If so, it puts the IOMMU-enabled case at disadvantage. I suppose most emulated systems would have at least some theoretical reasons to enable the IOMMU, e.g. as a GART replacement (say for 32-bit PCI devices) or for userspace drivers. So there are reasons to enable the IOMMU even when you don't have a real host IOMMU and you're not using nested guests. > > --- > > hw/pci.c | 185 +++++++++++++++++++++++++++++++++++++++++++++++++++- > > hw/pci.h | 74 +++++++++++++++++++++ > > hw/pci_internals.h | 12 ++++ > > qemu-common.h | 1 + > > 4 files changed, 271 insertions(+), 1 deletions(-) > > Almost nothing here is PCI specific. > Can this code go into dma.c/dma.h? > We would have struct DMADevice, APIs like device_dma_write etc. > This would help us get rid of the void * stuff as well? > Yeah, I know, that's similar to what I intended to do at first. Though I'm not sure that rids us of 'void *' stuff, quite on the contrary from what I've seen. Some stuff still needs to stay 'void *' (or an equivalent typedef, but still an opaque) simply because of the required level of abstraction that's needed. [snip] > > +void pci_register_iommu(PCIDevice *iommu, > > + PCITranslateFunc *translate) > > +{ > > + iommu->bus->iommu = iommu; > > + iommu->bus->translate = translate; > > +} > > + > > The above seems broken for secondary buses, right? Also, can we use > qdev for initialization in some way, instead of adding more APIs? E.g. > I think it would be nice if we could just use qdev command line flags to > control which bus is behind iommu and which isn't. > > Each bus must have its own IOMMU. The secondary bus should ask the primary bus instead of going through cpu_physical_memory_*(). If that isn't the case, it's broken and the secondary bus must be converted to the new API just like regular devices. I'll have a look at that. > > +void pci_memory_rw(PCIDevice *dev, > > + pcibus_t addr, > > + uint8_t *buf, > > + pcibus_t len, > > + int is_write) > > +{ > > + int err; > > + unsigned perms; > > + PCIDevice *iommu = dev->bus->iommu; > > + target_phys_addr_t paddr, plen; > > + > > + perms = is_write ? IOMMU_PERM_WRITE : IOMMU_PERM_READ; > > + > > + while (len) { > > + err = dev->bus->translate(iommu, dev, addr, &paddr, &plen, perms); > > + if (err) > > + return; > > + > > + /* The translation might be valid for larger regions. */ > > + if (plen > len) > > + plen = len; > > + > > + cpu_physical_memory_rw(paddr, buf, plen, is_write); > > + > > + len -= plen; > > + addr += plen; > > + buf += plen; > > + } > > +} > > + > > +static void pci_memory_register_map(PCIDevice *dev, > > + pcibus_t addr, > > + pcibus_t len, > > + target_phys_addr_t paddr, > > + PCIInvalidateMapFunc *invalidate, > > + void *invalidate_opaque) > > +{ > > + PCIMemoryMap *map; > > + > > + map = qemu_malloc(sizeof(PCIMemoryMap)); > > + map->addr = addr; > > + map->len = len; > > + map->paddr = paddr; > > + map->invalidate = invalidate; > > + map->invalidate_opaque = invalidate_opaque; > > + > > + QLIST_INSERT_HEAD(&dev->memory_maps, map, list); > > +} > > + > > +static void pci_memory_unregister_map(PCIDevice *dev, > > + target_phys_addr_t paddr, > > + target_phys_addr_t len) > > +{ > > + PCIMemoryMap *map; > > + > > + QLIST_FOREACH(map, &dev->memory_maps, list) { > > + if (map->paddr == paddr && map->len == len) { > > + QLIST_REMOVE(map, list); > > + free(map); > > + } > > + } > > +} > > + > > +void pci_memory_invalidate_range(PCIDevice *dev, > > + pcibus_t addr, > > + pcibus_t len) > > +{ > > + PCIMemoryMap *map; > > + > > + QLIST_FOREACH(map, &dev->memory_maps, list) { > > + if (ranges_overlap(addr, len, map->addr, map->len)) { > > + map->invalidate(map->invalidate_opaque); > > + QLIST_REMOVE(map, list); > > + free(map); > > + } > > + } > > +} > > + > > +void *pci_memory_map(PCIDevice *dev, > > + PCIInvalidateMapFunc *cb, > > + void *opaque, > > + pcibus_t addr, > > + target_phys_addr_t *len, > > + int is_write) > > +{ > > + int err; > > + unsigned perms; > > + PCIDevice *iommu = dev->bus->iommu; > > + target_phys_addr_t paddr, plen; > > + > > + perms = is_write ? IOMMU_PERM_WRITE : IOMMU_PERM_READ; > > + > > + plen = *len; > > + err = dev->bus->translate(iommu, dev, addr, &paddr, &plen, perms); > > + if (err) > > + return NULL; > > + > > + /* > > + * If this is true, the virtual region is contiguous, > > + * but the translated physical region isn't. We just > > + * clamp *len, much like cpu_physical_memory_map() does. > > + */ > > + if (plen < *len) > > + *len = plen; > > + > > + /* We treat maps as remote TLBs to cope with stuff like AIO. */ > > + if (cb) > > + pci_memory_register_map(dev, addr, *len, paddr, cb, opaque); > > + > > + return cpu_physical_memory_map(paddr, len, is_write); > > +} > > + > > All the above is really only useful for when there is an iommu, > right? So maybe we should shortcut all this if there's no iommu? > Some people (e.g. Blue) suggested I shouldn't make the IOMMU emulation a compile-time option, like I originally did. And I'm not sure any runtime "optimization" (as in likely()/unlikely()) is justified. [snip] > > diff --git a/hw/pci_internals.h b/hw/pci_internals.h > > index e3c93a3..fb134b9 100644 > > --- a/hw/pci_internals.h > > +++ b/hw/pci_internals.h > > @@ -33,6 +33,9 @@ struct PCIBus { > > Keep a count of the number of devices with raised IRQs. */ > > int nirq; > > int *irq_count; > > + > > + PCIDevice *iommu; > > + PCITranslateFunc *translate; > > }; > > Why is translate pointer in a bus? I think it's a work of an iommu? > Anthony and Paul thought it's best to simply as the parent bus for translation. I somewhat agree to that: devices that aren't IOMMU-aware simply attempt to do PCI requests to memory and the IOMMU translates and checks them transparently. > > struct PCIBridge { > > @@ -44,4 +47,13 @@ struct PCIBridge { > > const char *bus_name; > > }; > > > > +struct PCIMemoryMap { > > + pcibus_t addr; > > + pcibus_t len; > > + target_phys_addr_t paddr; > > + PCIInvalidateMapFunc *invalidate; > > + void *invalidate_opaque; > > Can we have a structure that encapsulates the mapping > data instead of a void *? > > Not really. 'invalidate_opaque' belongs to device code. It's meant to be a handle to easily identify the mapping. For example, DMA code wants to cancel AIO transfers when the bus requests the map to be invalidated. It's difficult to look that AIO transfer up using non-opaque data. > > + QLIST_ENTRY(PCIMemoryMap) list; > > +}; > > + > > #endif /* QEMU_PCI_INTERNALS_H */ > > diff --git a/qemu-common.h b/qemu-common.h > > index d735235..8b060e8 100644 > > --- a/qemu-common.h > > +++ b/qemu-common.h > > @@ -218,6 +218,7 @@ typedef struct SMBusDevice SMBusDevice; > > typedef struct PCIHostState PCIHostState; > > typedef struct PCIExpressHost PCIExpressHost; > > typedef struct PCIBus PCIBus; > > +typedef struct PCIMemoryMap PCIMemoryMap; > > typedef struct PCIDevice PCIDevice; > > typedef struct PCIBridge PCIBridge; > > typedef struct SerialState SerialState; > > -- > > 1.7.1 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe kvm" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=42920 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Or5Nl-000362-1r for qemu-devel@nongnu.org; Thu, 02 Sep 2010 04:42:57 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1Or5Ne-0005s5-DU for qemu-devel@nongnu.org; Thu, 02 Sep 2010 04:42:48 -0400 Received: from mail-fx0-f45.google.com ([209.85.161.45]:60379) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1Or5Ne-0005ru-3a for qemu-devel@nongnu.org; Thu, 02 Sep 2010 04:42:46 -0400 Received: by fxm7 with SMTP id 7so54995fxm.4 for ; Thu, 02 Sep 2010 01:42:44 -0700 (PDT) Sender: Eduard - Gabriel Munteanu Date: Thu, 2 Sep 2010 11:40:58 +0300 From: Eduard - Gabriel Munteanu Message-ID: <20100902084058.GA7211@localhost> References: <1283007298-10942-1-git-send-email-eduard.munteanu@linux360.ro> <1283007298-10942-3-git-send-email-eduard.munteanu@linux360.ro> <20100902052826.GB4273@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100902052826.GB4273@redhat.com> Subject: [Qemu-devel] Re: [PATCH 2/7] pci: memory access API and IOMMU support List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Michael S. Tsirkin" Cc: kvm@vger.kernel.org, joro@8bytes.org, qemu-devel@nongnu.org, blauwirbel@gmail.com, yamahata@valinux.co.jp, paul@codesourcery.com, avi@redhat.com On Thu, Sep 02, 2010 at 08:28:26AM +0300, Michael S. Tsirkin wrote: > On Sat, Aug 28, 2010 at 05:54:53PM +0300, Eduard - Gabriel Munteanu wrote: > > PCI devices should access memory through pci_memory_*() instead of > > cpu_physical_memory_*(). This also provides support for translation and > > access checking in case an IOMMU is emulated. > > > > Memory maps are treated as remote IOTLBs (that is, translation caches > > belonging to the IOMMU-aware device itself). Clients (devices) must > > provide callbacks for map invalidation in case these maps are > > persistent beyond the current I/O context, e.g. AIO DMA transfers. > > > > Signed-off-by: Eduard - Gabriel Munteanu > > > I am concerned about adding more pointer chaising on data path. > Could we have > 1. an iommu pointer in a device, inherited by secondary buses > when they are created and by devices from buses when they are attached. > 2. translation pointer in the iommu instead of the bus The first solution I proposed was based on qdev, that is, each DeviceState had an 'iommu' field. Translation would be done by recursively looking in the parent bus/devs for an IOMMU. But Anthony said we're better off with bus-specific APIs, mostly because (IIRC) there may be different types of addresses and it might be difficult to abstract those properly. I suppose I could revisit the idea by integrating the IOMMU in a PCIDevice as opposed to a DeviceState. Anthony, Paul, any thoughts on this? > 3. pci_memory_XX functions inline, doing fast path for non-iommu case: > > if (__builtin_expect(!dev->iommu, 1) > return cpu_memory_rw But isn't this some sort of 'if (likely(!dev->iommu))' from the Linux kernel? If so, it puts the IOMMU-enabled case at disadvantage. I suppose most emulated systems would have at least some theoretical reasons to enable the IOMMU, e.g. as a GART replacement (say for 32-bit PCI devices) or for userspace drivers. So there are reasons to enable the IOMMU even when you don't have a real host IOMMU and you're not using nested guests. > > --- > > hw/pci.c | 185 +++++++++++++++++++++++++++++++++++++++++++++++++++- > > hw/pci.h | 74 +++++++++++++++++++++ > > hw/pci_internals.h | 12 ++++ > > qemu-common.h | 1 + > > 4 files changed, 271 insertions(+), 1 deletions(-) > > Almost nothing here is PCI specific. > Can this code go into dma.c/dma.h? > We would have struct DMADevice, APIs like device_dma_write etc. > This would help us get rid of the void * stuff as well? > Yeah, I know, that's similar to what I intended to do at first. Though I'm not sure that rids us of 'void *' stuff, quite on the contrary from what I've seen. Some stuff still needs to stay 'void *' (or an equivalent typedef, but still an opaque) simply because of the required level of abstraction that's needed. [snip] > > +void pci_register_iommu(PCIDevice *iommu, > > + PCITranslateFunc *translate) > > +{ > > + iommu->bus->iommu = iommu; > > + iommu->bus->translate = translate; > > +} > > + > > The above seems broken for secondary buses, right? Also, can we use > qdev for initialization in some way, instead of adding more APIs? E.g. > I think it would be nice if we could just use qdev command line flags to > control which bus is behind iommu and which isn't. > > Each bus must have its own IOMMU. The secondary bus should ask the primary bus instead of going through cpu_physical_memory_*(). If that isn't the case, it's broken and the secondary bus must be converted to the new API just like regular devices. I'll have a look at that. > > +void pci_memory_rw(PCIDevice *dev, > > + pcibus_t addr, > > + uint8_t *buf, > > + pcibus_t len, > > + int is_write) > > +{ > > + int err; > > + unsigned perms; > > + PCIDevice *iommu = dev->bus->iommu; > > + target_phys_addr_t paddr, plen; > > + > > + perms = is_write ? IOMMU_PERM_WRITE : IOMMU_PERM_READ; > > + > > + while (len) { > > + err = dev->bus->translate(iommu, dev, addr, &paddr, &plen, perms); > > + if (err) > > + return; > > + > > + /* The translation might be valid for larger regions. */ > > + if (plen > len) > > + plen = len; > > + > > + cpu_physical_memory_rw(paddr, buf, plen, is_write); > > + > > + len -= plen; > > + addr += plen; > > + buf += plen; > > + } > > +} > > + > > +static void pci_memory_register_map(PCIDevice *dev, > > + pcibus_t addr, > > + pcibus_t len, > > + target_phys_addr_t paddr, > > + PCIInvalidateMapFunc *invalidate, > > + void *invalidate_opaque) > > +{ > > + PCIMemoryMap *map; > > + > > + map = qemu_malloc(sizeof(PCIMemoryMap)); > > + map->addr = addr; > > + map->len = len; > > + map->paddr = paddr; > > + map->invalidate = invalidate; > > + map->invalidate_opaque = invalidate_opaque; > > + > > + QLIST_INSERT_HEAD(&dev->memory_maps, map, list); > > +} > > + > > +static void pci_memory_unregister_map(PCIDevice *dev, > > + target_phys_addr_t paddr, > > + target_phys_addr_t len) > > +{ > > + PCIMemoryMap *map; > > + > > + QLIST_FOREACH(map, &dev->memory_maps, list) { > > + if (map->paddr == paddr && map->len == len) { > > + QLIST_REMOVE(map, list); > > + free(map); > > + } > > + } > > +} > > + > > +void pci_memory_invalidate_range(PCIDevice *dev, > > + pcibus_t addr, > > + pcibus_t len) > > +{ > > + PCIMemoryMap *map; > > + > > + QLIST_FOREACH(map, &dev->memory_maps, list) { > > + if (ranges_overlap(addr, len, map->addr, map->len)) { > > + map->invalidate(map->invalidate_opaque); > > + QLIST_REMOVE(map, list); > > + free(map); > > + } > > + } > > +} > > + > > +void *pci_memory_map(PCIDevice *dev, > > + PCIInvalidateMapFunc *cb, > > + void *opaque, > > + pcibus_t addr, > > + target_phys_addr_t *len, > > + int is_write) > > +{ > > + int err; > > + unsigned perms; > > + PCIDevice *iommu = dev->bus->iommu; > > + target_phys_addr_t paddr, plen; > > + > > + perms = is_write ? IOMMU_PERM_WRITE : IOMMU_PERM_READ; > > + > > + plen = *len; > > + err = dev->bus->translate(iommu, dev, addr, &paddr, &plen, perms); > > + if (err) > > + return NULL; > > + > > + /* > > + * If this is true, the virtual region is contiguous, > > + * but the translated physical region isn't. We just > > + * clamp *len, much like cpu_physical_memory_map() does. > > + */ > > + if (plen < *len) > > + *len = plen; > > + > > + /* We treat maps as remote TLBs to cope with stuff like AIO. */ > > + if (cb) > > + pci_memory_register_map(dev, addr, *len, paddr, cb, opaque); > > + > > + return cpu_physical_memory_map(paddr, len, is_write); > > +} > > + > > All the above is really only useful for when there is an iommu, > right? So maybe we should shortcut all this if there's no iommu? > Some people (e.g. Blue) suggested I shouldn't make the IOMMU emulation a compile-time option, like I originally did. And I'm not sure any runtime "optimization" (as in likely()/unlikely()) is justified. [snip] > > diff --git a/hw/pci_internals.h b/hw/pci_internals.h > > index e3c93a3..fb134b9 100644 > > --- a/hw/pci_internals.h > > +++ b/hw/pci_internals.h > > @@ -33,6 +33,9 @@ struct PCIBus { > > Keep a count of the number of devices with raised IRQs. */ > > int nirq; > > int *irq_count; > > + > > + PCIDevice *iommu; > > + PCITranslateFunc *translate; > > }; > > Why is translate pointer in a bus? I think it's a work of an iommu? > Anthony and Paul thought it's best to simply as the parent bus for translation. I somewhat agree to that: devices that aren't IOMMU-aware simply attempt to do PCI requests to memory and the IOMMU translates and checks them transparently. > > struct PCIBridge { > > @@ -44,4 +47,13 @@ struct PCIBridge { > > const char *bus_name; > > }; > > > > +struct PCIMemoryMap { > > + pcibus_t addr; > > + pcibus_t len; > > + target_phys_addr_t paddr; > > + PCIInvalidateMapFunc *invalidate; > > + void *invalidate_opaque; > > Can we have a structure that encapsulates the mapping > data instead of a void *? > > Not really. 'invalidate_opaque' belongs to device code. It's meant to be a handle to easily identify the mapping. For example, DMA code wants to cancel AIO transfers when the bus requests the map to be invalidated. It's difficult to look that AIO transfer up using non-opaque data. > > + QLIST_ENTRY(PCIMemoryMap) list; > > +}; > > + > > #endif /* QEMU_PCI_INTERNALS_H */ > > diff --git a/qemu-common.h b/qemu-common.h > > index d735235..8b060e8 100644 > > --- a/qemu-common.h > > +++ b/qemu-common.h > > @@ -218,6 +218,7 @@ typedef struct SMBusDevice SMBusDevice; > > typedef struct PCIHostState PCIHostState; > > typedef struct PCIExpressHost PCIExpressHost; > > typedef struct PCIBus PCIBus; > > +typedef struct PCIMemoryMap PCIMemoryMap; > > typedef struct PCIDevice PCIDevice; > > typedef struct PCIBridge PCIBridge; > > typedef struct SerialState SerialState; > > -- > > 1.7.1 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe kvm" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html