From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:35117) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aiv50-0005AE-KT for qemu-devel@nongnu.org; Wed, 23 Mar 2016 22:33:01 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aiv4y-0002w9-3a for qemu-devel@nongnu.org; Wed, 23 Mar 2016 22:32:58 -0400 Received: from mail-pa0-x244.google.com ([2607:f8b0:400e:c03::244]:34938) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aiv4x-0002vq-Ku for qemu-devel@nongnu.org; Wed, 23 Mar 2016 22:32:56 -0400 Received: by mail-pa0-x244.google.com with SMTP id fl4so1259208pad.2 for ; Wed, 23 Mar 2016 19:32:55 -0700 (PDT) References: <1458546426-26222-1-git-send-email-aik@ozlabs.ru> <1458546426-26222-19-git-send-email-aik@ozlabs.ru> <20160323021317.GQ23586@voom.redhat.com> <56F20D41.1060305@ozlabs.ru> <20160323061126.GW23586@voom.redhat.com> From: Alexey Kardashevskiy Message-ID: <56F351D0.7060104@ozlabs.ru> Date: Thu, 24 Mar 2016 13:32:48 +1100 MIME-Version: 1.0 In-Reply-To: <20160323061126.GW23586@voom.redhat.com> Content-Type: text/plain; charset=koi8-r; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH qemu v14 18/18] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: David Gibson Cc: Alex Williamson , qemu-ppc@nongnu.org, qemu-devel@nongnu.org On 03/23/2016 05:11 PM, David Gibson wrote: > On Wed, Mar 23, 2016 at 02:28:01PM +1100, Alexey Kardashevskiy wrote: >> On 03/23/2016 01:13 PM, David Gibson wrote: >>> On Mon, Mar 21, 2016 at 06:47:06PM +1100, Alexey Kardashevskiy wrote: >>>> This adds support for Dynamic DMA Windows (DDW) option defined by >>>> the SPAPR specification which allows to have additional DMA window(s) >>>> >>>> This implements DDW for emulated and VFIO devices. >>>> This reserves RTAS token numbers for DDW calls. >>>> >>>> This changes the TCE table migration descriptor to support dynamic >>>> tables as from now on, PHB will create as many stub TCE table objects >>>> as PHB can possibly support but not all of them might be initialized at >>>> the time of migration because DDW might or might not be requested by >>>> the guest. >>>> >>>> The "ddw" property is enabled by default on a PHB but for compatibility >>>> the pseries-2.5 machine and older disable it. >>>> >>>> This implements DDW for VFIO. The host kernel support is required. >>>> This adds a "levels" property to PHB to control the number of levels >>>> in the actual TCE table allocated by the host kernel, 0 is the default >>>> value to tell QEMU to calculate the correct value. Current hardware >>>> supports up to 5 levels. >>>> >>>> The existing linux guests try creating one additional huge DMA window >>>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded, >>>> the guest switches to dma_direct_ops and never calls TCE hypercalls >>>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM >>>> and not waste time on map/unmap later. This adds a "dma64_win_addr" >>>> property which is a bus address for the 64bit window and by default >>>> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware >>>> uses and this allows having emulated and VFIO devices on the same bus. >>>> >>>> This adds 4 RTAS handlers: >>>> * ibm,query-pe-dma-window >>>> * ibm,create-pe-dma-window >>>> * ibm,remove-pe-dma-window >>>> * ibm,reset-pe-dma-window >>>> These are registered from type_init() callback. >>>> >>>> These RTAS handlers are implemented in a separate file to avoid polluting >>>> spapr_iommu.c with PCI. >>>> >>>> Signed-off-by: Alexey Kardashevskiy >>>> --- >>>> hw/ppc/Makefile.objs | 1 + >>>> hw/ppc/spapr.c | 7 +- >>>> hw/ppc/spapr_pci.c | 73 ++++++++--- >>>> hw/ppc/spapr_rtas_ddw.c | 300 ++++++++++++++++++++++++++++++++++++++++++++ >>>> hw/vfio/common.c | 5 - >>>> include/hw/pci-host/spapr.h | 13 ++ >>>> include/hw/ppc/spapr.h | 16 ++- >>>> trace-events | 4 + >>>> 8 files changed, 395 insertions(+), 24 deletions(-) >>>> create mode 100644 hw/ppc/spapr_rtas_ddw.c >>>> >>>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs >>>> index c1ffc77..986b36f 100644 >>>> --- a/hw/ppc/Makefile.objs >>>> +++ b/hw/ppc/Makefile.objs >>>> @@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o >>>> ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy) >>>> obj-y += spapr_pci_vfio.o >>>> endif >>>> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o >>>> # PowerPC 4xx boards >>>> obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o >>>> obj-y += ppc4xx_pci.o >>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c >>>> index d0bb423..ef4c637 100644 >>>> --- a/hw/ppc/spapr.c >>>> +++ b/hw/ppc/spapr.c >>>> @@ -2362,7 +2362,12 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true); >>>> * pseries-2.5 >>>> */ >>>> #define SPAPR_COMPAT_2_5 \ >>>> - HW_COMPAT_2_5 >>>> + HW_COMPAT_2_5 \ >>>> + {\ >>>> + .driver = TYPE_SPAPR_PCI_HOST_BRIDGE,\ >>>> + .property = "ddw",\ >>>> + .value = stringify(off),\ >>>> + }, >>>> >>>> static void spapr_machine_2_5_instance_options(MachineState *machine) >>>> { >>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c >>>> index af99a36..3bb294a 100644 >>>> --- a/hw/ppc/spapr_pci.c >>>> +++ b/hw/ppc/spapr_pci.c >>>> @@ -803,12 +803,12 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev) >>>> return buf; >>>> } >>>> >>>> -static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb, >>>> - uint32_t liobn, >>>> - uint32_t page_shift, >>>> - uint64_t window_addr, >>>> - uint64_t window_size, >>>> - Error **errp) >>>> +void spapr_phb_dma_window_enable(sPAPRPHBState *sphb, >>>> + uint32_t liobn, >>>> + uint32_t page_shift, >>>> + uint64_t window_addr, >>>> + uint64_t window_size, >>>> + Error **errp) >>>> { >>>> sPAPRTCETable *tcet; >>>> uint32_t nb_table = window_size >> page_shift; >>>> @@ -825,10 +825,16 @@ static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb, >>>> return; >>>> } >>>> >>>> + if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) { >>>> + error_setg(errp, >>>> + "Attempt to use second window when DDW is disabled on PHB"); >>>> + return; >>>> + } >>> >>> This should never happen unless something is wrong with the tests in >>> the RTAS functions, yes? In which case it should probably be an >>> assert(). >> >> This should not. But this is called from the RTAS caller so I'd really like >> to have a message rather than assert() if that condition happens, here or in >> rtas_ibm_create_pe_dma_window(). > > It should only be called from RTAS if ddw is enabled though, yes? From RTAS and from the PHB reset handler. Well. I will get rid of spapr_phb_dma_window_enable/spapr_phb_dma_window_disable, they are quite useless when I look at them now. >> >>> >>>> spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table); >>>> } >>>> >>>> -static int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn) >>>> +int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn) >>>> { >>>> sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn); >>>> >>>> @@ -1492,14 +1498,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp) >>>> } >>>> >>>> /* DMA setup */ >>>> - tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn); >>>> - if (!tcet) { >>>> - error_report("No default TCE table for %s", sphb->dtbusname); >>>> - return; >>>> - } >>>> >>>> - memory_region_add_subregion_overlap(&sphb->iommu_root, 0, >>>> - spapr_tce_get_iommu(tcet), 0); >>>> + for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) { >>>> + tcet = spapr_tce_new_table(DEVICE(sphb), >>>> + SPAPR_PCI_LIOBN(sphb->index, i)); >>>> + if (!tcet) { >>>> + error_setg(errp, "Creating window#%d failed for %s", >>>> + i, sphb->dtbusname); >>>> + return; >>>> + } >>>> + memory_region_add_subregion_overlap(&sphb->iommu_root, 0, >>>> + spapr_tce_get_iommu(tcet), 0); >>>> + } >>>> >>>> sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free); >>>> } >>>> @@ -1517,11 +1527,16 @@ static int spapr_phb_children_reset(Object *child, void *opaque) >>>> >>>> void spapr_phb_dma_reset(sPAPRPHBState *sphb) >>>> { >>>> - sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn); >>>> Error *local_err = NULL; >>>> + int i; >>>> >>>> - if (tcet && tcet->enabled) { >>>> - spapr_phb_dma_window_disable(sphb, sphb->dma_liobn); >>>> + for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) { >>>> + uint32_t liobn = SPAPR_PCI_LIOBN(sphb->index, i); >>>> + sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn); >>>> + >>>> + if (tcet && tcet->enabled) { >>>> + spapr_phb_dma_window_disable(sphb, liobn); >>>> + } >>>> } >>>> >>>> /* Register default 32bit DMA window */ >>>> @@ -1562,6 +1577,13 @@ static Property spapr_phb_properties[] = { >>>> /* Default DMA window is 0..1GB */ >>>> DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0), >>>> DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000), >>>> + DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr, >>>> + 0x800000000000000ULL), >>>> + DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true), >>>> + DEFINE_PROP_UINT32("windows", sPAPRPHBState, windows_supported, >>>> + SPAPR_PCI_DMA_MAX_WINDOWS), >>> >>> What will happen if the user tries to set 'windows' larger than >>> SPAPR_PCI_DMA_MAX_WINDOWS? >> >> >> Oh. I need to replace SPAPR_PCI_DMA_MAX_WINDOWS with windows_supported >> everywhere, missed that. Besides that, there will be support for more >> windows, that's it. The host VFIO IOMMU driver will fail creating more >> windows but this is expected. For emulated windows, there will be more >> windows with no other consequences. > > Hmm.. is there actually a reason to have the windows property? Would > you be better off just using the compile time constant for now. I am afraid it is going to be 2 DMA windows forever as the other DMA tlb-ish facility coming does not use windows at all :) >> >> >> >> >>>> + DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask, >>>> + (1ULL << 12) | (1ULL << 16) | (1ULL << 24)), >>>> DEFINE_PROP_END_OF_LIST(), >>>> }; >>>> >>>> @@ -1815,6 +1837,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb, >>>> uint32_t interrupt_map_mask[] = { >>>> cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)}; >>>> uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7]; >>>> + uint32_t ddw_applicable[] = { >>>> + cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW), >>>> + cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW), >>>> + cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW) >>>> + }; >>>> + uint32_t ddw_extensions[] = { >>>> + cpu_to_be32(1), >>>> + cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW) >>>> + }; >>>> sPAPRTCETable *tcet; >>>> PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus; >>>> sPAPRFDT s_fdt; >>>> @@ -1839,6 +1870,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb, >>>> _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1)); >>>> _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS)); >>>> >>>> + /* Dynamic DMA window */ >>>> + if (phb->ddw_enabled) { >>>> + _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable, >>>> + sizeof(ddw_applicable))); >>>> + _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions", >>>> + &ddw_extensions, sizeof(ddw_extensions))); >>>> + } >>>> + >>>> /* Build the interrupt-map, this must matches what is done >>>> * in pci_spapr_map_irq >>>> */ >>>> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c >>>> new file mode 100644 >>>> index 0000000..37f805f >>>> --- /dev/null >>>> +++ b/hw/ppc/spapr_rtas_ddw.c >>>> @@ -0,0 +1,300 @@ >>>> +/* >>>> + * QEMU sPAPR Dynamic DMA windows support >>>> + * >>>> + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation. >>>> + * >>>> + * This program is free software; you can redistribute it and/or modify >>>> + * it under the terms of the GNU General Public License as published by >>>> + * the Free Software Foundation; either version 2 of the License, >>>> + * or (at your option) any later version. >>>> + * >>>> + * This program is distributed in the hope that it will be useful, >>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of >>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the >>>> + * GNU General Public License for more details. >>>> + * >>>> + * You should have received a copy of the GNU General Public License >>>> + * along with this program; if not, see . >>>> + */ >>>> + >>>> +#include "qemu/osdep.h" >>>> +#include "qemu/error-report.h" >>>> +#include "hw/ppc/spapr.h" >>>> +#include "hw/pci-host/spapr.h" >>>> +#include "trace.h" >>>> + >>>> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque) >>>> +{ >>>> + sPAPRTCETable *tcet; >>>> + >>>> + tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE); >>>> + if (tcet && tcet->enabled) { >>>> + ++*(unsigned *)opaque; >>>> + } >>>> + return 0; >>>> +} >>>> + >>>> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb) >>>> +{ >>>> + unsigned ret = 0; >>>> + >>>> + object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret); >>>> + >>>> + return ret; >>>> +} >>>> + >>>> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque) >>>> +{ >>>> + sPAPRTCETable *tcet; >>>> + >>>> + tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE); >>>> + if (tcet && !tcet->enabled) { >>>> + *(uint32_t *)opaque = tcet->liobn; >>>> + return 1; >>>> + } >>>> + return 0; >>>> +} >>>> + >>>> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb) >>>> +{ >>>> + uint32_t liobn = 0; >>>> + >>>> + object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn); >>>> + >>>> + return liobn; >>>> +} >>>> + >>>> +static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps, >>>> + uint64_t page_mask) >>>> +{ >>>> + int i, j; >>>> + uint32_t mask = 0; >>>> + const struct { int shift; uint32_t mask; } masks[] = { >>>> + { 12, RTAS_DDW_PGSIZE_4K }, >>>> + { 16, RTAS_DDW_PGSIZE_64K }, >>>> + { 24, RTAS_DDW_PGSIZE_16M }, >>>> + { 25, RTAS_DDW_PGSIZE_32M }, >>>> + { 26, RTAS_DDW_PGSIZE_64M }, >>>> + { 27, RTAS_DDW_PGSIZE_128M }, >>>> + { 28, RTAS_DDW_PGSIZE_256M }, >>>> + { 34, RTAS_DDW_PGSIZE_16G }, >>>> + }; >>>> + >>>> + for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) { >>>> + for (j = 0; j < ARRAY_SIZE(masks); ++j) { >>>> + if ((sps[i].page_shift == masks[j].shift) && >>>> + (page_mask & (1ULL << masks[j].shift))) { >>>> + mask |= masks[j].mask; >>>> + } >>>> + } >>>> + } >>>> + >>>> + return mask; >>>> +} >>>> + >>>> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu, >>>> + sPAPRMachineState *spapr, >>>> + uint32_t token, uint32_t nargs, >>>> + target_ulong args, >>>> + uint32_t nret, target_ulong rets) >>>> +{ >>>> + CPUPPCState *env = &cpu->env; >>>> + sPAPRPHBState *sphb; >>>> + uint64_t buid, max_window_size; >>>> + uint32_t avail, addr, pgmask = 0; >>>> + >>>> + if ((nargs != 3) || (nret != 5)) { >>>> + goto param_error_exit; >>>> + } >>>> + >>>> + buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2); >>>> + addr = rtas_ld(args, 0); >>>> + sphb = spapr_pci_find_phb(spapr, buid); >>>> + if (!sphb || !sphb->ddw_enabled) { >>>> + goto param_error_exit; >>>> + } >>>> + >>>> + /* Work out supported page masks */ >>>> + pgmask = spapr_query_mask(env->sps.sps, sphb->page_size_mask); >>> >>> There are a few potential problems here. First you're just >>> arbitrarily picking the first entry in the sps array to filter >> >> Why first? spapr_query_mask() has a loop 0..PPC_PAGE_SIZES_MAX_SZ. > > env->sps is a nested array, 0..PPC_PAGE_SIZES_MAX_SZ-1 for base page > sizes, then again for actual page sizes. You're only examing the > first "row" of that table. It kinda works because the 4k base page > size is first, which lists all the actual page size options. Ah. Right. So I need to walk through all of them, ok. >>> against, which doesn't seem right (except by accident). It's a little >>> bit odd filtering against guest page sizes at all, although I get what >>> you're really trying to do is filter against allowed host page sizes. >>> >>> The other problem is that silently filtering capabilities based on the >>> host can be a pain for migration - I've made the mistake and had it >>> bite me in the past. I think it would be safer to just check the >>> pagesizes requested in the property against what's possible and >>> outright fail if they don't match. For convenience you could also set >>> according to host capabilities if the user doesn't specify anything, >>> but that would require explicit handling of the "default" case. What are the host capabilities here? There is a page mask from the host IOMMU/PE which is 4K|64K|16M and many other sizes, this is supported always by IODA2. And there is PAGE_SIZE and huge pages (but only with -mempath) - so, 64K and 16M (with -mempath). And there is a "ddw-query" RTAS call which tells the guest if it can use 16M or not. How do you suggest I advertise 16M to the guest? If I always advertise 16M and there is no -mempath, the guest won't try smaller page size. So - if the user wants 16M IOMMU pages, he has to use -mempath and in addition to that explicitely say -global spapr-pci-host-bridge.pgsz=16M, and by default enable only 4K and 64K (or just 4K?)? I am fine with this, it just means more work for libvirt folks. >> >> For the migration purposes, both guests should be started with or without >> hugepages enabled; this is taken into account already. Besides that, the >> result of "query" won't differ. > > Hmm.. if you're migrating between TCG and KVM or between PR and HV > these could change as well. I'm not sure that works at the moment, > but I'd prefer not to introduce any more barriers to it than we have > to. > >>> Remember that this code will be relevant for DDW with emulated >>> devices, even if VFIO is not in play at all. >>> >>> All those considerations aside, it seems like it would make more sense >>> to do this filtering during device realize, rather than leaving it >>> until the guest queries. >> >> The result will be the same, it only depends on whether hugepages are >> enabled or not and this happens at the start time. But yes, feels more >> accurate to do this in PHB realize(), I'll move it. >> >> >>> >>>> + /* >>>> + * This is "Largest contiguous block of TCEs allocated specifically >>>> + * for (that is, are reserved for) this PE". >>>> + * Return the maximum number as maximum supported RAM size was in 4K pages. >>>> + */ >>>> + max_window_size = MACHINE(spapr)->maxram_size >> SPAPR_TCE_PAGE_SHIFT; >>> >>> Will maxram_size always be enough? There will sometimes be an >>> alignment gap between the "base" RAM and the hotpluggable RAM, meaning >>> that if everything is plugged the last RAM address will be beyond >>> maxram_size. Will that require pushing this number up, or will the >>> guest "repack" the RAM layout when it maps it into the TCE tables? >> >> >> Hm. I do not know what the guest does to DDW on memory hotplug but this is a >> valid point... What QEMU helper does return the last available address in >> the system memory address space? Like memblock_end_of_DRAM() in the kernel, >> I would use that instead. > > There is a last_ram_offset() but that's in the ram_addr_t address What do you call "ram_addr_t address space"? Non-pluggable memory and machine->ram_size is its size? > space, which isn't necessarily the same as the physical address space > (though it's usually similar). I looked at the code and it looks like machine->ram_size is always what came from "-m" and it won't grow if some memory was hotplugged, is that correct? And hotpluggable memory does not appear in the global ram_list as a RAMBlock? > You can have a look at what we check > in (the TCG/PR version of) H_ENTER which needs to check this as well. Ok, thanks for the pointer. -- Alexey