* [Qemu-devel] [PATCH] AIO: Reduce number of threads for 32bit hosts
@ 2015-01-14 0:56 Alexander Graf
2015-01-14 7:37 ` Paolo Bonzini
2015-02-12 15:38 ` Stefan Hajnoczi
0 siblings, 2 replies; 12+ messages in thread
From: Alexander Graf @ 2015-01-14 0:56 UTC (permalink / raw)
To: Kevin Wolf; +Cc: qemu-devel, Stefan Hajnoczi
On hosts with limited virtual address space (32bit pointers), we can very
easily run out of virtual memory with big thread pools.
Instead, we should limit ourselves to small pools to keep memory footprint
low on those systems.
This patch fixes random VM stalls like
(process:25114): GLib-ERROR **: gmem.c:103: failed to allocate 1048576 bytes
on 32bit ARM systems for me.
Signed-off-by: Alexander Graf <agraf@suse.de>
---
thread-pool.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/thread-pool.c b/thread-pool.c
index e2cac8e..87a3ea9 100644
--- a/thread-pool.c
+++ b/thread-pool.c
@@ -299,7 +299,12 @@ static void thread_pool_init_one(ThreadPool *pool, AioContext *ctx)
qemu_mutex_init(&pool->lock);
qemu_cond_init(&pool->worker_stopped);
qemu_sem_init(&pool->sem, 0);
- pool->max_threads = 64;
+ if (sizeof(pool) == 4) {
+ /* 32bit systems run out of virtual memory quickly */
+ pool->max_threads = 4;
+ } else {
+ pool->max_threads = 64;
+ }
pool->new_thread_bh = aio_bh_new(ctx, spawn_thread_bh_fn, pool);
QLIST_INIT(&pool->head);
--
1.7.12.4
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH] AIO: Reduce number of threads for 32bit hosts
2015-01-14 0:56 [Qemu-devel] [PATCH] AIO: Reduce number of threads for 32bit hosts Alexander Graf
@ 2015-01-14 7:37 ` Paolo Bonzini
2015-01-14 10:20 ` Kevin Wolf
2015-02-12 15:38 ` Stefan Hajnoczi
1 sibling, 1 reply; 12+ messages in thread
From: Paolo Bonzini @ 2015-01-14 7:37 UTC (permalink / raw)
To: Alexander Graf, Kevin Wolf; +Cc: qemu-devel, Stefan Hajnoczi
On 14/01/2015 01:56, Alexander Graf wrote:
> + if (sizeof(pool) == 4) {
> + /* 32bit systems run out of virtual memory quickly */
> + pool->max_threads = 4;
> + } else {
> + pool->max_threads = 64;
> + }
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
The same problem applies to coroutine stacks, and those cannot be
throttled down as easily. But I guess if you limit the number of
threads, the guest gets slowed down and doesn't create as many coroutines.
It would be nice to have a way to measure coroutine stack usage, similar
to what the kernel does.
Paolo
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH] AIO: Reduce number of threads for 32bit hosts
2015-01-14 7:37 ` Paolo Bonzini
@ 2015-01-14 10:20 ` Kevin Wolf
2015-01-14 11:18 ` Paolo Bonzini
0 siblings, 1 reply; 12+ messages in thread
From: Kevin Wolf @ 2015-01-14 10:20 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Alexander Graf, Stefan Hajnoczi, qemu-devel
Am 14.01.2015 um 08:37 hat Paolo Bonzini geschrieben:
>
>
> On 14/01/2015 01:56, Alexander Graf wrote:
> > + if (sizeof(pool) == 4) {
> > + /* 32bit systems run out of virtual memory quickly */
> > + pool->max_threads = 4;
> > + } else {
> > + pool->max_threads = 64;
> > + }
>
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>
> The same problem applies to coroutine stacks, and those cannot be
> throttled down as easily. But I guess if you limit the number of
> threads, the guest gets slowed down and doesn't create as many coroutines.
Shouldn't we rather try and decrease the stack sizes a bit? 1 MB per
coroutine is really a lot, and as I understand it, threads take even
more by default.
> It would be nice to have a way to measure coroutine stack usage, similar
> to what the kernel does.
The information which pages are mapped should be there somewhere...
Kevin
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH] AIO: Reduce number of threads for 32bit hosts
2015-01-14 10:20 ` Kevin Wolf
@ 2015-01-14 11:18 ` Paolo Bonzini
2015-01-14 13:38 ` Kevin Wolf
2015-01-14 14:24 ` Markus Armbruster
0 siblings, 2 replies; 12+ messages in thread
From: Paolo Bonzini @ 2015-01-14 11:18 UTC (permalink / raw)
To: Kevin Wolf; +Cc: Alexander Graf, Stefan Hajnoczi, qemu-devel
On 14/01/2015 11:20, Kevin Wolf wrote:
>> > The same problem applies to coroutine stacks, and those cannot be
>> > throttled down as easily. But I guess if you limit the number of
>> > threads, the guest gets slowed down and doesn't create as many coroutines.
> Shouldn't we rather try and decrease the stack sizes a bit? 1 MB per
> coroutine is really a lot, and as I understand it, threads take even
> more by default.
Yup, 2 MB. Last time I proposed this, I think Markus was strongly in
the "better safe than sorry" camp. :)
But thread pool workers definitely don't need a big stack.
>> > It would be nice to have a way to measure coroutine stack usage, similar
>> > to what the kernel does.
> The information which pages are mapped should be there somewhere...
Yes, there is mincore(2). The complicated part is doing it fast, but
perhaps it doesn't need to be fast.
I tried gathering warning from GCC's -Wstack-usage=1023 option and the
block layer does not seem to have functions with huge stacks in the I/O
path.
So, assuming a maximum stack depth of 50 (already pretty generous since
there shouldn't be any recursive calls) a 100K stack should be pretty
much okay for coroutines and thread-pool threads.
That said there are some offenders in the device models. Other
QemuThreads, especially VCPU threads, had better stay with a big stack.
Paolo
1024-2048:
block/linux-aio.c ioq_submit stack usage is 1104 bytes
block/mirror.c mirror_run stack usage is 1280 bytes
block/qapi.c bdrv_query_image_info stack usage is 1152 bytes
block/vmdk.c vmdk_open_vmdk4 stack usage is 1840 bytes
block/vpc.c vpc_create stack usage is 1152 bytes
cpus.c qmp_memsave stack usage is 1120 bytes
cpus.c qmp_pmemsave stack usage is 1120 bytes
disas/sparc.c build_hash_table stack usage is 1072 bytes
fsdev/virtfs-proxy-helper.c main stack usage is 1264 bytes
hw/arm/highbank.c calxeda_init stack usage is 1200 bytes
hw/arm/stellaris.c stellaris_init stack usage is 1120 bytes
hw/arm/virt.c machvirt_init stack usage is 1264 bytes
hw/block/dataplane/virtio-blk.c handle_notify stack usage is 1632 bytes
hw/block/virtio-blk.c virtio_blk_dma_restart_bh stack usage is 1600 bytes
hw/block/virtio-blk.c virtio_blk_handle_output stack usage is 1584 bytes
hw/core/qdev.c qdev_get_legacy_property stack usage is 1104 bytes
hw/net/vmxnet_tx_pkt.c vmxnet_tx_pkt_do_sw_fragmentation stack usage is 1168 bytes
hw/ppc/e500.c ppce500_load_device_tree stack usage is 1072 bytes
hw/scsi/megasas.c megasas_dcmd_ld_get_list stack usage is 1152 bytes
hw/tpm/tpm_passthrough.c tpm_passthrough_test_tpmdev stack usage is 1216 bytes
linux-user/main.c main stack usage is 1664 bytes
linux-user/main.c main stack usage is 1696 bytes
linux-user/main.c main stack usage is 1728 bytes
linux-user/main.c main stack usage is 1744 bytes
linux-user/main.c main stack usage is 1760 bytes
linux-user/main.c main stack usage is 1776 bytes
linux-user/main.c main stack usage is 1856 bytes
linux-user/main.c main stack usage is 1872 bytes
linux-user/main.c main stack usage is 1888 bytes
linux-user/main.c main stack usage is 1952 bytes
linux-user/main.c main stack usage is 1984 bytes
linux-user/main.c main stack usage is 2032 bytes
migration/vmstate.c get_unused_buffer stack usage is 1088 bytes
monitor.c monitor_parse_command stack usage is 1424 bytes
monitor.c parse_cmdline stack usage is 1136 bytes
qemu-img.c img_rebase stack usage is 1248 bytes
qobject/json-parser.c parse_error stack usage is 1280 bytes
qobject/qjson.c to_json stack usage is 1136 bytes
qtest.c qtest_send stack usage is 1280 bytes
savevm.c do_savevm stack usage is 1424 bytes
slirp/slirp.c if_encap stack usage is 1680 bytes
target-i386/kvm.c kvm_get_apic stack usage is 1072 bytes
target-i386/kvm.c kvm_put_apic stack usage is 1072 bytes
target-xtensa/xtensa-semi.c helper_simcall stack usage is 1296 bytes
trace/control.c trace_init_events stack usage is 1152 bytes
ui/keymaps.c parse_keyboard_layout stack usage is 1152 bytes
ui/vnc.c addr_to_string stack usage is 1136 bytes
ui/vnc.c protocol_client_init stack usage is 1072 bytes
ui/vnc.c qmp_query_vnc stack usage is 1344 bytes
ui/vnc.c vnc_basic_info_get stack usage is 1136 bytes
ui/vnc-enc-tight.c tight_detect_smooth_image16 stack usage is 1184 bytes
ui/vnc-enc-tight.c tight_detect_smooth_image24 stack usage is 1152 bytes
ui/vnc-enc-tight.c tight_detect_smooth_image32 stack usage is 1184 bytes
ui/vnc-enc-zrle-template.c zrle_encode_tile15be stack usage is 1264 bytes
ui/vnc-enc-zrle-template.c zrle_encode_tile15le stack usage is 1264 bytes
ui/vnc-enc-zrle-template.c zrle_encode_tile16be stack usage is 1264 bytes
ui/vnc-enc-zrle-template.c zrle_encode_tile16le stack usage is 1264 bytes
ui/vnc-enc-zrle-template.c zrle_encode_tile24abe stack usage is 1200 bytes
ui/vnc-enc-zrle-template.c zrle_encode_tile24ale stack usage is 1200 bytes
ui/vnc-enc-zrle-template.c zrle_encode_tile24bbe stack usage is 1200 bytes
ui/vnc-enc-zrle-template.c zrle_encode_tile24ble stack usage is 1200 bytes
ui/vnc-enc-zrle-template.c zrle_encode_tile32be stack usage is 1184 bytes
ui/vnc-enc-zrle-template.c zrle_encode_tile32le stack usage is 1184 bytes
ui/vnc-enc-zrle-template.c zrle_encode_tile8ne stack usage is 1184 bytes
util/qemu-option.c opts_do_parse stack usage is 1280 bytes
util/qemu-option.c opts_parse stack usage is 1104 bytes
2048-4096:
block.c bdrv_swap stack usage is 3920 bytes
coroutine-ucontext.c qemu_coroutine_new stack usage is 2192 bytes
hw/arm/musicpal.c eth_send stack usage is 2160 bytes
hw/dma/etraxfs_dma.c channel_out_run stack usage is 2160 bytes
hw/net/cadence_gem.c gem_receive stack usage is 2192 bytes
hw/net/cadence_gem.c gem_transmit stack usage is 2176 bytes
hw/net/eepro100.c tx_command stack usage is 2736 bytes
hw/net/mcf_fec.c mcf_fec_do_tx stack usage is 2144 bytes
hw/ppc/spapr_pci.c spapr_populate_pci_dt stack usage is 4032 bytes
hw/scsi/megasas.c megasas_ctrl_get_info stack usage is 2176 bytes
libcacard/vcard_emul_nss.c vcard_emul_rsa_op stack usage is 2224 bytes
monitor.c file_completion stack usage is 3328 bytes
monitor.c host_net_remove_completion stack usage is 2176 bytes
monitor.c netdev_del_completion stack usage is 2112 bytes
monitor.c set_link_completion stack usage is 2112 bytes
qemu-char.c tcp_chr_connect stack usage is 2608 bytes
qemu-nbd.c main stack usage is 2336 bytes
slirp/misc.c fork_exec stack usage is 2176 bytes
slirp/socket.c sosendoob stack usage is 2144 bytes
target-arm/helper.c register_cp_regs_for_features stack usage is 3232 bytes
target-arm/helper.c register_cp_regs_for_features stack usage is 3360 bytes
target-i386/kvm.c kvm_get_msrs stack usage is 2448 bytes
target-i386/kvm.c kvm_put_msrs stack usage is 2448 bytes
ui/sdl.c sdl_update_caption stack usage is 2112 bytes
util/qemu-config.c qemu_config_parse stack usage is 2416 bytes
4096-8192:
block.c bdrv_commit stack usage is 4208 bytes
block/raw-posix.c hdev_open stack usage is 4320 bytes
block/vmdk.c vmdk_parse_extents stack usage is 4848 bytes
block/vvfat.c check_directory_consistency stack usage is 5152 bytes
gdbstub.c gdb_monitor_output stack usage is 4144 bytes
hw/arm/boot.c set_kernel_args stack usage is 4176 bytes
hw/audio/ac97.c read_audio stack usage is 4208 bytes
hw/audio/ac97.c write_audio stack usage is 4208 bytes
hw/audio/es1370.c es1370_transfer_audio stack usage is 4240 bytes
hw/audio/gus.c GUS_read_DMA stack usage is 4192 bytes
hw/audio/marvell_88w8618.c mv88w8618_audio_callback stack usage is 4160 bytes
hw/audio/milkymist-ac97.c ac97_in_cb stack usage is 4176 bytes
hw/audio/milkymist-ac97.c ac97_out_cb stack usage is 4176 bytes
hw/audio/sb16.c write_audio stack usage is 4192 bytes
hw/char/sclpconsole-lm.c process_mdb stack usage is 4144 bytes
hw/s390x/sclp.c sclp_service_call stack usage is 4336 bytes
hw/scsi/lsi53c895a.c lsi_memcpy stack usage is 4192 bytes
hw/scsi/megasas.c megasas_dcmd_cfg_read stack usage is 4224 bytes
hw/scsi/megasas.c megasas_dcmd_pd_get_list stack usage is 5808 bytes
hw/tpm/tpm_passthrough.c tpm_passthrough_open_sysfs_cancel stack usage is 4144 bytes
linux-user/elfload.c vma_dump_size stack usage is 4128 bytes
linux-user/syscall.c do_openat stack usage is 4176 bytes
net/tap.c net_bridge_run_helper stack usage is 4720 bytes
qemu-bridge-helper.c parse_acl_file stack usage is 4176 bytes
qemu-char.c fd_chr_read stack usage is 4160 bytes
qemu-char.c pty_chr_read stack usage is 4160 bytes
qemu-char.c qemu_chr_fe_printf stack usage is 4352 bytes
qemu-char.c qemu_chr_open_pty stack usage is 4176 bytes
qemu-char.c tcp_chr_read stack usage is 4192 bytes
qga/commands-posix.c get_pci_driver stack usage is 4144 bytes
qga/main.c channel_event_cb stack usage is 4160 bytes
target-i386/kvm.c kvm_arch_init_vcpu stack usage is 4144 bytes
target-s390x/ioinst.c ioinst_handle_chsc_scsc stack usage is 4112 bytes
target-s390x/misc_helper.c helper_stsi stack usage is 4128 bytes
ui/vnc-auth-sasl.c vnc_client_read_sasl stack usage is 4160 bytes
util/oslib-posix.c qemu_init_exec_dir stack usage is 4144 bytes
util/path.c init_paths stack usage is 4144 bytes
8192-16384:
block/vmdk.c vmdk_parent_open stack usage is 10288 bytes
block/vmdk.c vmdk_read_cid stack usage is 10304 bytes
gdbstub.c gdb_handle_packet stack usage is 8336 bytes
hw/acpi/core.c acpi_table_add stack usage is 8336 bytes
hw/audio/cs4231a.c cs_write_audio stack usage is 12384 bytes
hw/core/qdev-properties-system.c set_netdev stack usage is 8288 bytes
hw/i386/kvm/pci-assign.c assign_failed_examine stack usage is 12480 bytes
hw/i386/pc.c load_linux stack usage is 8336 bytes
hw/net/rtl8139.c rtl8139_transmit_one stack usage is 8256 bytes
hw/net/xgmac.c xgmac_enet_send stack usage is 8288 bytes
hw/s390x/s390-pci-inst.c clp_service_call stack usage is 8384 bytes
hw/sparc64/sun4u.c sun4u_NVRAM_set_params stack usage is 8272 bytes
hw/sparc/sun4m.c nvram_init stack usage is 8272 bytes
hw/vfio/pci.c vfio_initfn stack usage is 8592 bytes
linux-user/elfload.c elf_core_dump stack usage is 12640 bytes
linux-user/elfload.c elf_core_dump stack usage is 8544 bytes
linux-user/elfload.c vma_dump_size stack usage is 8224 bytes
net/net.c qemu_del_net_client stack usage is 8256 bytes
net/net.c qmp_set_link stack usage is 8256 bytes
16384+:
block/vmdk.c vmdk_create stack usage is 29376 bytes
block/vmdk.c vmdk_write_cid stack usage is 20544 bytes
hw/arm/nseries.c n8x0_init stack usage is 65728 bytes
hw/char/virtio-serial-bus.c control_out stack usage is 49312 bytes
hw/char/virtio-serial-bus.c discard_vq_data stack usage is 49216 bytes
hw/char/virtio-serial-bus.c send_control_msg stack usage is 49232 bytes
hw/char/virtio-serial-bus.c write_to_port stack usage is 49264 bytes
hw/display/vmware_vga.c vmsvga_fifo_run stack usage is 20688 bytes
hw/dma/xilinx_axidma.c stream_process_mem2s stack usage is 16480 bytes
hw/net/opencores_eth.c open_eth_start_xmit stack usage is 65600 bytes
hw/net/virtio-net.c virtio_net_flush_tx stack usage is 65664 bytes
hw/net/virtio-net.c virtio_net_handle_ctrl stack usage is 49424 bytes
hw/net/virtio-net.c virtio_net_handle_ctrl stack usage is 49440 bytes
hw/net/virtio-net.c virtio_net_receive stack usage is 65728 bytes
hw/virtio/virtio-balloon.c virtio_balloon_handle_output stack usage is 49376 bytes
hw/virtio/virtio-rng.c chr_read stack usage is 49264 bytes
net/net.c nc_sendv_compat stack usage is 69680 bytes
net/socket.c net_socket_send stack usage is 69712 bytes
net/tap.c net_init_tap stack usage is 16704 bytes
ui/vnc-jobs.c vnc_worker_thread_loop stack usage is 75632 bytes
"might be unbounded":
backends/baum.c baum_write_packet stack usage might be unbounded
backends/baum.c baum_write stack usage might be unbounded
backends/rng-random.c entropy_available stack usage might be unbounded
block/qapi.c dump_qobject stack usage might be unbounded
block/vpc.c vpc_co_write stack usage might be unbounded
exec.c mem_commit stack usage might be unbounded
hw/block/nvme.c nvme_map_prp stack usage might be unbounded
hw/bt/l2cap.c l2cap_cframe_in stack usage might be unbounded
hw/i386/multiboot.c load_multiboot stack usage might be unbounded
hw/intc/xics.c ics_reset stack usage might be unbounded
hw/net/fsl_etsec/rings.c etsec_walk_rx_ring stack usage might be unbounded
hw/net/spapr_llan.c h_send_logical_lan stack usage might be unbounded
hw/ppc/spapr.c spapr_fixup_cpu_dt stack usage might be unbounded
hw/ppc/spapr_pci.c spapr_phb_realize stack usage might be unbounded
hw/ssi/xilinx_spips.c xilinx_spips_flush_txfifo stack usage might be unbounded
hw/usb/dev-hid.c usb_hid_handle_data stack usage might be unbounded
hw/usb/dev-mtp.c usb_mtp_add_str stack usage might be unbounded
hw/usb/dev-mtp.c usb_mtp_handle_data stack usage might be unbounded
hw/usb/dev-wacom.c usb_wacom_handle_data stack usage might be unbounded
hw/usb/hcd-xhci.c xhci_process_commands stack usage might be unbounded
hw/usb/redirect.c usbredir_handle_data stack usage might be unbounded
linux-user/elfload.c load_elf_image stack usage might be unbounded
linux-user/elfload.c load_symbols stack usage might be unbounded
linux-user/syscall.c do_accept4 stack usage might be unbounded
linux-user/syscall.c do_bind stack usage might be unbounded
linux-user/syscall.c do_connect stack usage might be unbounded
linux-user/syscall.c do_getpeername stack usage might be unbounded
linux-user/syscall.c do_getsockname stack usage might be unbounded
linux-user/syscall.c do_recvfrom stack usage might be unbounded
linux-user/syscall.c do_sendrecvmsg_locked stack usage might be unbounded
linux-user/syscall.c do_sendto stack usage might be unbounded
linux-user/syscall.c do_setsockopt stack usage might be unbounded
linux-user/syscall.c do_syscall stack usage might be unbounded
net/tap.c tap_receive_iov stack usage might be unbounded
qemu-char.c tcp_chr_write stack usage might be unbounded
target-arm/helper.c helper_dc_zva stack usage might be unbounded
ui/vnc-enc-hextile-template.h send_hextile_tile_32 stack usage might be unbounded
ui/vnc-enc-hextile-template.h send_hextile_tile_generic_32 stack usage might be unbounded
ui/vnc-enc-tight.c send_palette_rect stack usage might be unbounded
util/iov.c qemu_iovec_clone stack usage might be unbounded
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH] AIO: Reduce number of threads for 32bit hosts
2015-01-14 11:18 ` Paolo Bonzini
@ 2015-01-14 13:38 ` Kevin Wolf
2015-01-14 13:49 ` Paolo Bonzini
2015-01-14 14:24 ` Markus Armbruster
1 sibling, 1 reply; 12+ messages in thread
From: Kevin Wolf @ 2015-01-14 13:38 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Alexander Graf, Stefan Hajnoczi, qemu-devel
Am 14.01.2015 um 12:18 hat Paolo Bonzini geschrieben:
>
>
> On 14/01/2015 11:20, Kevin Wolf wrote:
> >> > The same problem applies to coroutine stacks, and those cannot be
> >> > throttled down as easily. But I guess if you limit the number of
> >> > threads, the guest gets slowed down and doesn't create as many coroutines.
> > Shouldn't we rather try and decrease the stack sizes a bit? 1 MB per
> > coroutine is really a lot, and as I understand it, threads take even
> > more by default.
>
> Yup, 2 MB. Last time I proposed this, I think Markus was strongly in
> the "better safe than sorry" camp. :)
>
> But thread pool workers definitely don't need a big stack.
Right, I think we need to consider what kind of thread it is. For the
moment, I'm talking about the block layer only.
> >> > It would be nice to have a way to measure coroutine stack usage, similar
> >> > to what the kernel does.
> > The information which pages are mapped should be there somewhere...
>
> Yes, there is mincore(2). The complicated part is doing it fast, but
> perhaps it doesn't need to be fast.
Well, what do you want to use it for? I thought it would only be for a
one-time check where we usually end up rather than something that would
be enabled in production, but maybe I misunderstood.
> I tried gathering warning from GCC's -Wstack-usage=1023 option and the
> block layer does not seem to have functions with huge stacks in the I/O
> path.
>
> So, assuming a maximum stack depth of 50 (already pretty generous since
> there shouldn't be any recursive calls) a 100K stack should be pretty
> much okay for coroutines and thread-pool threads.
The potential problem in the block layer is long backing file chains.
Perhaps we need to do something to solve that iteratively instead of
recursively.
> That said there are some offenders in the device models. Other
> QemuThreads, especially VCPU threads, had better stay with a big stack.
Yes, that's not exactly surprising.
Kevin
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH] AIO: Reduce number of threads for 32bit hosts
2015-01-14 13:38 ` Kevin Wolf
@ 2015-01-14 13:49 ` Paolo Bonzini
2015-01-14 14:07 ` Kevin Wolf
0 siblings, 1 reply; 12+ messages in thread
From: Paolo Bonzini @ 2015-01-14 13:49 UTC (permalink / raw)
To: Kevin Wolf; +Cc: Alexander Graf, Stefan Hajnoczi, qemu-devel
On 14/01/2015 14:38, Kevin Wolf wrote:
> Well, what do you want to use it for? I thought it would only be for a
> one-time check where we usually end up rather than something that would
> be enabled in production, but maybe I misunderstood.
No, you didn't. Though I guess we could limit the checks to the yield
points. If we have BDS recursion, as in the backing file case, yield
points should not be far from the deepest part of the stack.
Another possibility (which cannot be enabled in production) is to fill
the stack with a known 64-bit value, and do a binary search when the
coroutine is destroyed.
>> I tried gathering warning from GCC's -Wstack-usage=1023 option and the
>> block layer does not seem to have functions with huge stacks in the I/O
>> path.
>>
>> So, assuming a maximum stack depth of 50 (already pretty generous since
>> there shouldn't be any recursive calls) a 100K stack should be pretty
>> much okay for coroutines and thread-pool threads.
>
> The potential problem in the block layer is long backing file chains.
> Perhaps we need to do something to solve that iteratively instead of
> recursively.
Basically first read stuff from the current BDS, and then "fill in the
blanks" with a tail call on bs->backing_file? That would be quite a
change, and we'd need a stopgap measure like Alex's patch in the meanwhile.
Paolo
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH] AIO: Reduce number of threads for 32bit hosts
2015-01-14 13:49 ` Paolo Bonzini
@ 2015-01-14 14:07 ` Kevin Wolf
2015-01-14 14:09 ` Alexander Graf
0 siblings, 1 reply; 12+ messages in thread
From: Kevin Wolf @ 2015-01-14 14:07 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Alexander Graf, Stefan Hajnoczi, qemu-devel
Am 14.01.2015 um 14:49 hat Paolo Bonzini geschrieben:
>
>
> On 14/01/2015 14:38, Kevin Wolf wrote:
> > Well, what do you want to use it for? I thought it would only be for a
> > one-time check where we usually end up rather than something that would
> > be enabled in production, but maybe I misunderstood.
>
> No, you didn't. Though I guess we could limit the checks to the yield
> points. If we have BDS recursion, as in the backing file case, yield
> points should not be far from the deepest part of the stack.
>
> Another possibility (which cannot be enabled in production) is to fill
> the stack with a known 64-bit value, and do a binary search when the
> coroutine is destroyed.
Maybe that's the easiest one, yes.
> >> I tried gathering warning from GCC's -Wstack-usage=1023 option and the
> >> block layer does not seem to have functions with huge stacks in the I/O
> >> path.
> >>
> >> So, assuming a maximum stack depth of 50 (already pretty generous since
> >> there shouldn't be any recursive calls) a 100K stack should be pretty
> >> much okay for coroutines and thread-pool threads.
> >
> > The potential problem in the block layer is long backing file chains.
> > Perhaps we need to do something to solve that iteratively instead of
> > recursively.
>
> Basically first read stuff from the current BDS, and then "fill in the
> blanks" with a tail call on bs->backing_file? That would be quite a
> change, and we'd need a stopgap measure like Alex's patch in the meanwhile.
Basically block.c would do something like get_block_status() first and
only then call the read/write functions of the individual drivers. But
yes, that's more a theoretical consideration at this point.
I think with the 50 recursions that you calculated we should be fine in
practice for now. I would however strongly recommend finally implementing
a guard page for coroutine stacks before we make that change.
Anyway, the thread pool workers aren't affected by any of this, so they
would be the obvious first step.
Kevin
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH] AIO: Reduce number of threads for 32bit hosts
2015-01-14 14:07 ` Kevin Wolf
@ 2015-01-14 14:09 ` Alexander Graf
2015-01-15 10:00 ` Kevin Wolf
0 siblings, 1 reply; 12+ messages in thread
From: Alexander Graf @ 2015-01-14 14:09 UTC (permalink / raw)
To: Kevin Wolf, Paolo Bonzini; +Cc: qemu-devel, Stefan Hajnoczi
On 01/14/15 15:07, Kevin Wolf wrote:
> Am 14.01.2015 um 14:49 hat Paolo Bonzini geschrieben:
>>
>> On 14/01/2015 14:38, Kevin Wolf wrote:
>>> Well, what do you want to use it for? I thought it would only be for a
>>> one-time check where we usually end up rather than something that would
>>> be enabled in production, but maybe I misunderstood.
>> No, you didn't. Though I guess we could limit the checks to the yield
>> points. If we have BDS recursion, as in the backing file case, yield
>> points should not be far from the deepest part of the stack.
>>
>> Another possibility (which cannot be enabled in production) is to fill
>> the stack with a known 64-bit value, and do a binary search when the
>> coroutine is destroyed.
> Maybe that's the easiest one, yes.
>
>>>> I tried gathering warning from GCC's -Wstack-usage=1023 option and the
>>>> block layer does not seem to have functions with huge stacks in the I/O
>>>> path.
>>>>
>>>> So, assuming a maximum stack depth of 50 (already pretty generous since
>>>> there shouldn't be any recursive calls) a 100K stack should be pretty
>>>> much okay for coroutines and thread-pool threads.
>>> The potential problem in the block layer is long backing file chains.
>>> Perhaps we need to do something to solve that iteratively instead of
>>> recursively.
>> Basically first read stuff from the current BDS, and then "fill in the
>> blanks" with a tail call on bs->backing_file? That would be quite a
>> change, and we'd need a stopgap measure like Alex's patch in the meanwhile.
> Basically block.c would do something like get_block_status() first and
> only then call the read/write functions of the individual drivers. But
> yes, that's more a theoretical consideration at this point.
>
> I think with the 50 recursions that you calculated we should be fine in
> practice for now. I would however strongly recommend finally implementing
> a guard page for coroutine stacks before we make that change.
We could just write mprotect an excessively mapped page as guard page, no?
> Anyway, the thread pool workers aren't affected by any of this, so they
> would be the obvious first step.
Yes, ideally we would have the maximum number of threads be runtime
configurable too. That way you can adjust them to your workload.
Alex
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH] AIO: Reduce number of threads for 32bit hosts
2015-01-14 11:18 ` Paolo Bonzini
2015-01-14 13:38 ` Kevin Wolf
@ 2015-01-14 14:24 ` Markus Armbruster
1 sibling, 0 replies; 12+ messages in thread
From: Markus Armbruster @ 2015-01-14 14:24 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Kevin Wolf, Alexander Graf, Stefan Hajnoczi, qemu-devel
Paolo Bonzini <pbonzini@redhat.com> writes:
> On 14/01/2015 11:20, Kevin Wolf wrote:
>>> > The same problem applies to coroutine stacks, and those cannot be
>>> > throttled down as easily. But I guess if you limit the number of
>>> > threads, the guest gets slowed down and doesn't create as many coroutines.
>> Shouldn't we rather try and decrease the stack sizes a bit? 1 MB per
>> coroutine is really a lot, and as I understand it, threads take even
>> more by default.
>
> Yup, 2 MB. Last time I proposed this, I think Markus was strongly in
> the "better safe than sorry" camp. :)
Yup. Running out of stack is a nasty failure mode.
> But thread pool workers definitely don't need a big stack.
When analysis leads to an upper bound for stack size, by all means use
it.
Absent rigorous analysis backed by testing, I recommend to throw address
space at the problem.
32 bit limitations can force us to throw less generously (and be less
safe) there. Not a good reason to compromise safety on 64 bit machines
as well, though.
[...]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH] AIO: Reduce number of threads for 32bit hosts
2015-01-14 14:09 ` Alexander Graf
@ 2015-01-15 10:00 ` Kevin Wolf
0 siblings, 0 replies; 12+ messages in thread
From: Kevin Wolf @ 2015-01-15 10:00 UTC (permalink / raw)
To: Alexander Graf; +Cc: Paolo Bonzini, qemu-devel, Stefan Hajnoczi
Am 14.01.2015 um 15:09 hat Alexander Graf geschrieben:
> On 01/14/15 15:07, Kevin Wolf wrote:
> >Am 14.01.2015 um 14:49 hat Paolo Bonzini geschrieben:
> >>
> >>On 14/01/2015 14:38, Kevin Wolf wrote:
> >>>Well, what do you want to use it for? I thought it would only be for a
> >>>one-time check where we usually end up rather than something that would
> >>>be enabled in production, but maybe I misunderstood.
> >>No, you didn't. Though I guess we could limit the checks to the yield
> >>points. If we have BDS recursion, as in the backing file case, yield
> >>points should not be far from the deepest part of the stack.
> >>
> >>Another possibility (which cannot be enabled in production) is to fill
> >>the stack with a known 64-bit value, and do a binary search when the
> >>coroutine is destroyed.
> >Maybe that's the easiest one, yes.
> >
> >>>>I tried gathering warning from GCC's -Wstack-usage=1023 option and the
> >>>>block layer does not seem to have functions with huge stacks in the I/O
> >>>>path.
> >>>>
> >>>>So, assuming a maximum stack depth of 50 (already pretty generous since
> >>>>there shouldn't be any recursive calls) a 100K stack should be pretty
> >>>>much okay for coroutines and thread-pool threads.
> >>>The potential problem in the block layer is long backing file chains.
> >>>Perhaps we need to do something to solve that iteratively instead of
> >>>recursively.
> >>Basically first read stuff from the current BDS, and then "fill in the
> >>blanks" with a tail call on bs->backing_file? That would be quite a
> >>change, and we'd need a stopgap measure like Alex's patch in the meanwhile.
> >Basically block.c would do something like get_block_status() first and
> >only then call the read/write functions of the individual drivers. But
> >yes, that's more a theoretical consideration at this point.
> >
> >I think with the 50 recursions that you calculated we should be fine in
> >practice for now. I would however strongly recommend finally implementing
> >a guard page for coroutine stacks before we make that change.
>
> We could just write mprotect an excessively mapped page as guard page, no?
Not just write protect, but PROT_NONE, but otherwise yes, I think that's
how it usually done (or directly with a mmap instead of modifying it
after the fact).
> >Anyway, the thread pool workers aren't affected by any of this, so they
> >would be the obvious first step.
>
> Yes, ideally we would have the maximum number of threads be runtime
> configurable too. That way you can adjust them to your workload.
Should be easy enough to add an option to raw-{posix,win32} for that.
Kevin
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH] AIO: Reduce number of threads for 32bit hosts
2015-01-14 0:56 [Qemu-devel] [PATCH] AIO: Reduce number of threads for 32bit hosts Alexander Graf
2015-01-14 7:37 ` Paolo Bonzini
@ 2015-02-12 15:38 ` Stefan Hajnoczi
2015-02-12 15:59 ` Kevin Wolf
1 sibling, 1 reply; 12+ messages in thread
From: Stefan Hajnoczi @ 2015-02-12 15:38 UTC (permalink / raw)
To: Alexander Graf; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi
[-- Attachment #1: Type: text/plain, Size: 1392 bytes --]
On Wed, Jan 14, 2015 at 01:56:56AM +0100, Alexander Graf wrote:
> On hosts with limited virtual address space (32bit pointers), we can very
> easily run out of virtual memory with big thread pools.
>
> Instead, we should limit ourselves to small pools to keep memory footprint
> low on those systems.
>
> This patch fixes random VM stalls like
>
> (process:25114): GLib-ERROR **: gmem.c:103: failed to allocate 1048576 bytes
>
> on 32bit ARM systems for me.
>
> Signed-off-by: Alexander Graf <agraf@suse.de>
> ---
> thread-pool.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/thread-pool.c b/thread-pool.c
> index e2cac8e..87a3ea9 100644
> --- a/thread-pool.c
> +++ b/thread-pool.c
> @@ -299,7 +299,12 @@ static void thread_pool_init_one(ThreadPool *pool, AioContext *ctx)
> qemu_mutex_init(&pool->lock);
> qemu_cond_init(&pool->worker_stopped);
> qemu_sem_init(&pool->sem, 0);
> - pool->max_threads = 64;
> + if (sizeof(pool) == 4) {
> + /* 32bit systems run out of virtual memory quickly */
> + pool->max_threads = 4;
> + } else {
> + pool->max_threads = 64;
> + }
> pool->new_thread_bh = aio_bh_new(ctx, spawn_thread_bh_fn, pool);
>
> QLIST_INIT(&pool->head);
This patch has not been applied.
Have there been any changes in this area?
Stefan
[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH] AIO: Reduce number of threads for 32bit hosts
2015-02-12 15:38 ` Stefan Hajnoczi
@ 2015-02-12 15:59 ` Kevin Wolf
0 siblings, 0 replies; 12+ messages in thread
From: Kevin Wolf @ 2015-02-12 15:59 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: Alexander Graf, Stefan Hajnoczi, qemu-devel
[-- Attachment #1: Type: text/plain, Size: 1682 bytes --]
Am 12.02.2015 um 16:38 hat Stefan Hajnoczi geschrieben:
> On Wed, Jan 14, 2015 at 01:56:56AM +0100, Alexander Graf wrote:
> > On hosts with limited virtual address space (32bit pointers), we can very
> > easily run out of virtual memory with big thread pools.
> >
> > Instead, we should limit ourselves to small pools to keep memory footprint
> > low on those systems.
> >
> > This patch fixes random VM stalls like
> >
> > (process:25114): GLib-ERROR **: gmem.c:103: failed to allocate 1048576 bytes
> >
> > on 32bit ARM systems for me.
> >
> > Signed-off-by: Alexander Graf <agraf@suse.de>
> > ---
> > thread-pool.c | 7 ++++++-
> > 1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/thread-pool.c b/thread-pool.c
> > index e2cac8e..87a3ea9 100644
> > --- a/thread-pool.c
> > +++ b/thread-pool.c
> > @@ -299,7 +299,12 @@ static void thread_pool_init_one(ThreadPool *pool, AioContext *ctx)
> > qemu_mutex_init(&pool->lock);
> > qemu_cond_init(&pool->worker_stopped);
> > qemu_sem_init(&pool->sem, 0);
> > - pool->max_threads = 64;
> > + if (sizeof(pool) == 4) {
> > + /* 32bit systems run out of virtual memory quickly */
> > + pool->max_threads = 4;
> > + } else {
> > + pool->max_threads = 64;
> > + }
> > pool->new_thread_bh = aio_bh_new(ctx, spawn_thread_bh_fn, pool);
> >
> > QLIST_INIT(&pool->head);
>
> This patch has not been applied.
>
> Have there been any changes in this area?
I think the idea was to implement guard pages for coroutine stacks and
then decrease the stack size of both coroutines and worker threads on
32 bit.
Kevin
[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2015-02-12 15:59 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-14 0:56 [Qemu-devel] [PATCH] AIO: Reduce number of threads for 32bit hosts Alexander Graf
2015-01-14 7:37 ` Paolo Bonzini
2015-01-14 10:20 ` Kevin Wolf
2015-01-14 11:18 ` Paolo Bonzini
2015-01-14 13:38 ` Kevin Wolf
2015-01-14 13:49 ` Paolo Bonzini
2015-01-14 14:07 ` Kevin Wolf
2015-01-14 14:09 ` Alexander Graf
2015-01-15 10:00 ` Kevin Wolf
2015-01-14 14:24 ` Markus Armbruster
2015-02-12 15:38 ` Stefan Hajnoczi
2015-02-12 15:59 ` Kevin Wolf
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.