From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:58735) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1duCTI-0003gR-CD for qemu-devel@nongnu.org; Tue, 19 Sep 2017 02:57:29 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1duCTE-0003vx-82 for qemu-devel@nongnu.org; Tue, 19 Sep 2017 02:57:28 -0400 Received: from mail-it0-x244.google.com ([2607:f8b0:4001:c0b::244]:33729) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1duCTD-0003uq-W1 for qemu-devel@nongnu.org; Tue, 19 Sep 2017 02:57:24 -0400 Received: by mail-it0-x244.google.com with SMTP id g18so1878477itg.0 for ; Mon, 18 Sep 2017 23:57:23 -0700 (PDT) From: Alexey Kardashevskiy References: <20170918101709.30421-1-aik@ozlabs.ru> <20170918101709.30421-2-aik@ozlabs.ru> <2debed31-4c28-6dc3-b159-1fdc3fbf87f5@ozlabs.ru> Message-ID: Date: Tue, 19 Sep 2017 16:57:17 +1000 MIME-Version: 1.0 In-Reply-To: <2debed31-4c28-6dc3-b159-1fdc3fbf87f5@ozlabs.ru> Content-Type: text/plain; charset=utf-8 Content-Language: en-AU Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH qemu v3 01/13] memory: Postpone flatview and dispatch tree building till all devices are added List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Paolo Bonzini , qemu-devel@nongnu.org On 19/09/17 12:09, Alexey Kardashevskiy wrote: > On 19/09/17 00:54, Paolo Bonzini wrote: >> On 18/09/2017 12:16, Alexey Kardashevskiy wrote: >>> Most devices use at least one address space and every time a new address >>> space is added, flat views and dispatch trees are rebuild for all address >>> spaces. This is not a problem for a relatively small amount of devices but >>> even 50 virtio-pci devices use more than 8GB of RAM. >>> >>> What happens that on every flatview/dispatch rebuild, new arrays are >>> allocated and old ones release but the release is done via RCU so until >>> an entire machine is build, they are not released. >>> >>> This wraps devices creation into memory_region_transaction_begin/commit >>> to massively reduce amount of flat view/dispatch tree (re)allocations. >>> >>> Signed-off-by: Alexey Kardashevskiy >>> --- >>> Changes: >>> v2: >>> * wrapped qemu_run_machine_init_done_notifiers() as well >>> --- >>> vl.c | 9 +++++++++ >>> 1 file changed, 9 insertions(+) >>> >>> diff --git a/vl.c b/vl.c >>> index 9e62e92aea..e4f2ece590 100644 >>> --- a/vl.c >>> +++ b/vl.c >>> @@ -4741,12 +4741,16 @@ int main(int argc, char **argv, char **envp) >>> igd_gfx_passthru(); >>> >>> /* init generic devices */ >>> + memory_region_transaction_begin(); >>> + >>> rom_set_order_override(FW_CFG_ORDER_OVERRIDE_DEVICE); >>> if (qemu_opts_foreach(qemu_find_opts("device"), >>> device_init_func, NULL, NULL)) { >>> exit(1); >>> } >>> >>> + memory_region_transaction_commit(); >>> + >>> cpu_synchronize_all_post_init(); >>> >>> rom_reset_order_override(); >>> @@ -4829,8 +4833,13 @@ int main(int argc, char **argv, char **envp) >>> /* TODO: once all bus devices are qdevified, this should be done >>> * when bus is created by qdev.c */ >>> qemu_register_reset(qbus_reset_all_fn, sysbus_get_default()); >>> + >>> + memory_region_transaction_begin(); >>> + >>> qemu_run_machine_init_done_notifiers(); >>> >>> + memory_region_transaction_commit(); >>> + >>> if (rom_check_and_register_reset() != 0) { >>> error_report("rom check and register reset failed"); >>> exit(1); >>> >> >> This should not be necessary given the other patches; the PCI devices >> have an empty address space at the beginning, so there are other less >> intrusive optimizations to do instead with the same effect: >> >> 1) as a start, the "|= root->enabled" can resolve aliases. This should >> be enough for the PCI device case. >> >> 2) also, after patch 2 you know that the address space has no listeners >> here, so the begin/commit isn't really needed. Instead you can use the >> open-coded loop to directly generate the FlatView. This avoids touching >> _all_ address spaces, which is already an improvement from O(n^2) to >> O(n) rebuilds on device startup. 03/13 does this already, no? >> >> 3) you can consult the list (or hash table :)) of live FlatViews (which >> means you keep it live after memory_region_transaction_commit ends, and >> only clear it on the next call), and reuse an existing FlatView. Note >> that the number of distinct FlatViews should be very few, > > > I keep missing this bit - why few? Each virtio-pci device creates 2 AS, > with proxy->modern_bar and pci_dev->bus_master_container_region which are > unique and not aliases. Remember, 500 virtio devices is my test case ;) More details: pci_dev->bus_master_container_region is a root and it is enabled but its only child pci_dev->bus_master_enable_region is not. Ok, in flatview_topology_update() I can render a FV, see that it is empty (view->nr==0) and share an empty FV in this case too, this halves the number of FVs (from ~1000 to ~500 for 500 virtio devices). But proxy->modern_bar (which has an modern_cfg alias which is a root of an AS) is enabled since it is created and I could disable it and enable afterwards but since a PCI device enablement is done by writing to the config space, I kind of stuck here. I can do something like this and it helps a lot (now with -S I end up having 4 FVs and much better start time) but it is kinda hacky and "memory: Postpone flatview and dispatch tree building till all devices are added" solves this better imho, no? diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c index 3268c16966..fa2cd7cf2c 100644 --- a/hw/virtio/virtio-pci.c +++ b/hw/virtio/virtio-pci.c @@ -629,6 +629,8 @@ static void virtio_write_config(PCIDevice *pci_dev, uint32_t address, VirtIODevice *vdev = virtio_bus_get_device(&proxy->bus); struct virtio_pci_cfg_cap *cfg; + memory_region_set_enabled(&proxy->modern_bar, true); + pci_default_write_config(pci_dev, address, val, len); if (range_covers_byte(address, len, PCI_COMMAND) && @@ -662,6 +664,8 @@ static uint32_t virtio_read_config(PCIDevice *pci_dev, VirtIOPCIProxy *proxy = DO_UPCAST(VirtIOPCIProxy, pci_dev, pci_dev); struct virtio_pci_cfg_cap *cfg; + memory_region_set_enabled(&proxy->modern_bar, true); + if (proxy->config_cap && ranges_overlap(address, len, proxy->config_cap + offsetof(struct virtio_pci_cfg_cap, pci_cfg_data), @@ -1790,6 +1794,8 @@ static void virtio_pci_realize(PCIDevice *pci_dev, Error **errp) 0, memory_region_size(&proxy->modern_bar)); + memory_region_set_enabled(&proxy->modern_bar, false); + address_space_init(&proxy->modern_as, &proxy->modern_cfg, "virtio-pci-cfg-as"); > > >> so feel free >> to revert from hash table to list in v4 if you prefer. >>> 4) you can skip address_space_update_topology_pass if >> QTAILQ_EMPTY(&as->listeners). This can provide some startup speed >> improvements. >> >> Optimizations 2/3/4 should be moved to the end of the series, or even in >> a separate post. The first can be done in the beginning too, as you prefer. > > > > -- Alexey