* EXT4 vs LVM performance for VMs @ 2016-02-11 20:50 Premysl Kouril 2016-02-12 6:12 ` Andi Kleen 0 siblings, 1 reply; 14+ messages in thread From: Premysl Kouril @ 2016-02-11 20:50 UTC (permalink / raw) To: linux-fsdevel Hi All, We are in process of setting up a new cloud infrastructure and we are deciding if we should use file-backed virtual machines or LVM-volume-backed virtual machines and I would like to kindly ask community to confirm some of our performance related findings and/or advice if there is something that can be done about it. I heard that performance difference between LVM volumes and files on filesystem (when using for VM disks) is only about 1-5% but this is not what we are seeing. Regarding our test hypervisor - it is filled with ssd disks each capable of up to 130 000 write IOPS, has plenty of CPU and RAM. We test performance by running fio inside virtual machines (KVM based) hosted on this hypervisor. In order to achieve comparable and consistent benchmark results, virtual machines are single core VMs and CPU hyperthreading is turned off on the hypervisor. Furthermore CPU cores are dedicated for the virtual machines using the cpu pinning (so particular VM runs only on particular CPU core). here is the fio command: fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=3G --numjobs=1 --readwrite=randwrite Kernel version: 3.16.0-60-generic We mount the filesystem with: mount -o noatime,data=writeback,barrier=0 /dev/md127 /var/lib/nova/instances/ We also disable journaling for the testing purposes. Here are results if we use LVM volume as the Virtual machine disk versus if we use file (qcow2 or raw) stored on EXT4 filesystem for the VM disk. 1 Test: Maximum sustainable write IOPS achieved on single VM: LVM: ~ 16 000 IOPS EXT4: ~ 4 000 IOPS 2 Test: Maximum sustainable write IOPS achieved on hypervisor by running multiple test VMs: LVM: ~ 40 000 IOPS (and then MDRAID5 hit 100% CPU utilization) EXT4: ~ 20 000 IOPS So basically LVM seems to perform much better. Note that in the second test the raid started to be bottleneck so it is possible that LVM layer would be capable of even more on faster raid. In the Test 1: - on LVM we hit 100% utilization of the qemu VM process where we had: usr:50%,sys50%,wait:0% - on EXT4 we hit 100% utilization of the qemu VM process where we hed: usr:30%,sys:30%,wait30% So it seem that performance of EXT4 is significantly lower and when using EXT4 we saw significant wait time. I tried to look at it a bit (using some custom Systemtap script) and here is my observartion. When checking what is going on on the CPU which is executing the KVM qemu process of the test VM it seems it is executing 2 main threads (these 2 threads are responsible for most of the time spent on CPU) and about 60-70 other threads which I assume are some filesystem workers. Out of the 2 main threads one of it seems OK and doesn't seems to be waiting for lock or anything - most of the time I see this thread leaving the CPU it is normal scheduler interrupt. The other main thread is actually spending a lot of time waiting for lock here, basically when this thread is leaving CPU it often does so here: TID: 9838 waited 4916317 ns here: 0xffffffffc1e4f12b : 0xffffffffc1e4f12b [stap_b9c4a8366b974feec4893d4b5949417_17490+0x912b/0x0] 0xffffffffc1e5061b : 0xffffffffc1e5061b [stap_b9c4a8366b974feec4893d4b5949417_17490+0xa61b/0x0] 0xffffffffc1e51e7a : 0xffffffffc1e51e7a [stap_b9c4a8366b974feec4893d4b5949417_17490+0xbe7a/0x0] 0xffffffffc1e46014 : 0xffffffffc1e46014 [stap_b9c4a8366b974feec4893d4b5949417_17490+0x14/0x0] 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel] 0xffffffff8176d839 : schedule+0x29/0x70 [kernel] 0xffffffffc0857c2d : 0xffffffffc0857c2d [kvm] 0xffffffff810b6250 : autoremove_wake_function+0x0/0x40 [kernel] 0xffffffffc0870f41 : 0xffffffffc0870f41 [kvm] 0xffffffffc085ace2 : 0xffffffffc085ace2 [kvm] 0xffffffff812132d8 : fsnotify+0x228/0x2f0 [kernel] 0xffffffff810e7e9a : do_futex+0x10a/0x6a0 [kernel] 0xffffffff811e8890 : do_vfs_ioctl+0x2e0/0x4c0 [kernel] 0xffffffffc0864ce4 : 0xffffffffc0864ce4 [kvm] 0xffffffff811e8af1 : sys_ioctl+0x81/0xa0 [kernel] 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel] When looking at the 60 worker threads, they often leave CPU in following places and wait there some bigger amount of CPU cycles: First place: TID: 57936 waited 2092552 ns here: 0xffffffffc18a815b : 0xffffffffc18a815b [stap_140d3798c84d55c6fb0e060c1ecb741_57914+0x915b/0x0] 0xffffffffc18a965b : 0xffffffffc18a965b [stap_140d3798c84d55c6fb0e060c1ecb741_57914+0xa65b/0x0] 0xffffffffc18ab09a : 0xffffffffc18ab09a [stap_140d3798c84d55c6fb0e060c1ecb741_57914+0xc09a/0x0] 0xffffffffc189f014 : 0xffffffffc189f014 [stap_140d3798c84d55c6fb0e060c1ecb741_57914+0x14/0x0] 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel] 0xffffffff8176d839 : schedule+0x29/0x70 [kernel] 0xffffffff810e511e : futex_wait_queue_me+0xde/0x140 [kernel] 0xffffffff810e5c42 : futex_wait+0x182/0x290 [kernel] 0xffffffff810e7e6e : do_futex+0xde/0x6a0 [kernel] 0xffffffff810e84a1 : SyS_futex+0x71/0x150 [kernel] 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel] Second place: TID: 57937 waited 1542013 ns here: 0xffffffffc18a815b : 0xffffffffc18a815b [stap_140d3798c84d55c6fb0e060c1ecb741_57914+0x915b/0x0] 0xffffffffc18a965b : 0xffffffffc18a965b [stap_140d3798c84d55c6fb0e060c1ecb741_57914+0xa65b/0x0] 0xffffffffc18ab09a : 0xffffffffc18ab09a [stap_140d3798c84d55c6fb0e060c1ecb741_57914+0xc09a/0x0] 0xffffffffc189f014 : 0xffffffffc189f014 [stap_140d3798c84d55c6fb0e060c1ecb741_57914+0x14/0x0] 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel] 0xffffffff8176dd49 : schedule_preempt_disabled+0x29/0x70 [kernel] 0xffffffff8176fb95 : __mutex_lock_slowpath+0xd5/0x1d0 [kernel] 0xffffffff8176fcaf : mutex_lock+0x1f/0x2f [kernel] 0xffffffff81250ca9 : ext4_file_write_iter+0x79/0x3a0 [kernel] 0xffffffff811d5251 : new_sync_write+0x81/0xb0 [kernel] 0xffffffff811d5a07 : vfs_write+0xb7/0x1f0 [kernel] 0xffffffff811d6732 : sys_pwrite64+0x72/0xb0 [kernel] 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel] So basically we attribute the lower EXT4 performance to these points where things need to be synchronized using locks but this is just what we see at high level so I would be curious if dev community thinks this might be the cause. All in all I'd like to ask following questions: 1) Are the benchmark results as you would expect? 2) Can the lower performance be attributed to the locking? 2) Is there something we could do to improve performance of the filesystem? 3) Are there any plans for development in this area? Regards, Premysl Kouril ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: EXT4 vs LVM performance for VMs 2016-02-11 20:50 EXT4 vs LVM performance for VMs Premysl Kouril @ 2016-02-12 6:12 ` Andi Kleen 2016-02-12 9:09 ` Premysl Kouril 0 siblings, 1 reply; 14+ messages in thread From: Andi Kleen @ 2016-02-12 6:12 UTC (permalink / raw) To: Premysl Kouril; +Cc: linux-fsdevel Premysl Kouril <premysl.kouril@gmail.com> writes: > > So basically we attribute the lower EXT4 performance to these points > where things need to be synchronized using locks but this is just what > we see at high level so I would be curious if dev community thinks > this might be the cause. Except for the last the backtraces you're showing are for futex locks, which are not used by the kernel, but some user process. So the locking problem is somewhere in the user space setup (perhaps in qemu). This would indicate your ext4 set up is not the same as LVM. The later is the inode mutex which is needed for POSIX semantics to get atomic writes. -Andi -- ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: EXT4 vs LVM performance for VMs 2016-02-12 6:12 ` Andi Kleen @ 2016-02-12 9:09 ` Premysl Kouril 2016-02-12 13:38 ` Theodore Ts'o 0 siblings, 1 reply; 14+ messages in thread From: Premysl Kouril @ 2016-02-12 9:09 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-fsdevel > Except for the last the backtraces you're showing are for futex locks, > which are not used by the kernel, but some user process. So the locking > problem is somewhere in the user space setup (perhaps in qemu). > This would indicate your ext4 set up is not the same as LVM. Setup on the qemu/kvm is exactly the same (same test box, same command line arguments execept for the argument for the virtual machine disk which reference LVM volume or EXT4 based file) > > The later is the inode mutex which is needed for POSIX semantics > to get atomic writes. > Hmm, given that the user space thread is futexed in the 0xffffffff812132d8 : fsnotify+0x228/0x2f0 [kernel] isn't it really that the user space thread is blocked as a result of lock contention on the kernel inode mutex (ie: 0xffffffff811d5251 : new_sync_write+0x81/0xb0 [kernel]) and basically waits for the filesystem to be done with the write ? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: EXT4 vs LVM performance for VMs 2016-02-12 9:09 ` Premysl Kouril @ 2016-02-12 13:38 ` Theodore Ts'o 2016-02-12 14:13 ` Premysl Kouril 0 siblings, 1 reply; 14+ messages in thread From: Theodore Ts'o @ 2016-02-12 13:38 UTC (permalink / raw) To: Premysl Kouril; +Cc: Andi Kleen, linux-fsdevel On Fri, Feb 12, 2016 at 10:09:39AM +0100, Premysl Kouril wrote: > > Except for the last the backtraces you're showing are for futex locks, > > which are not used by the kernel, but some user process. So the locking > > problem is somewhere in the user space setup (perhaps in qemu). > > This would indicate your ext4 set up is not the same as LVM. > > Setup on the qemu/kvm is exactly the same (same test box, same command > line arguments execept for the argument for the virtual machine disk > which reference LVM volume or EXT4 based file) You mentioned using qcow; if you're using qcow, than the userspace qemu/kvm process will need to do its own locking to manage its own space management file. In general, if you need to do allocation management, either because you are writing to a sparse file, and ext4 has to do block allocation, or if you are writing to a fallocated file, and ext4 has to keep track of whether a block has been written to that location before (and if not, do a journalled transaction to clear the unwritten bit), there will be extra work that has to be done at the qcow or ext4 layer that doesn't have to be done at the LVM layer. Of course, this work is providing extra services (such as space management and/or not revealing previously written block contents from another customer to your current customer's VM, which might make your local data protection authorities cranky). So the devil is very much in the details of how you set up the hypervisor, and "except for the argument for the virtual machine disk" is leaving an awful lot unspecified. Whether you preallocated and pre-zeroed the file can make a difference, whether you are using buffered or direct I/O based on the cache parameter makes a difference, etc., etc. Cheers, - Ted ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: EXT4 vs LVM performance for VMs 2016-02-12 13:38 ` Theodore Ts'o @ 2016-02-12 14:13 ` Premysl Kouril 2016-02-12 16:53 ` Theodore Ts'o 0 siblings, 1 reply; 14+ messages in thread From: Premysl Kouril @ 2016-02-12 14:13 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Andi Kleen, linux-fsdevel > > You mentioned using qcow; if you're using qcow, than the userspace > qemu/kvm process will need to do its own locking to manage its own > space management file. > I tried raw files too, there is not much difference in performance. > In general, if you need to do allocation management, either because > you are writing to a sparse file, and ext4 has to do block allocation, > or if you are writing to a fallocated file, and ext4 has to keep track > of whether a block has been written to that location before (and if > not, do a journalled transaction to clear the unwritten bit), there > will be extra work that has to be done at the qcow or ext4 layer that > doesn't have to be done at the LVM layer. Of course, this work is > providing extra services (such as space management and/or not > revealing previously written block contents from another customer to > your current customer's VM, which might make your local data > protection authorities cranky). > The performance results which I posted in my original post are after all the block allocation is done. In other words: on the testing machine we do a first fio run which writes and allocates the testing file (during this first run the fio performance is actually much worse that what I reported, single VM performance is about 700 IOPS) and then we do second fio run and we take the benchmark numbers from this second run. > So the devil is very much in the details of how you set up the > hypervisor, and "except for the argument for the virtual machine disk" > is leaving an awful lot unspecified. Whether you preallocated and > pre-zeroed the file can make a difference, whether you are using > buffered or direct I/O based on the cache parameter makes a > difference, etc., etc. Here is the command line of the qemu when using raw files on EXT4: usr/bin/qemu-system-x86_64 -name instance-00000327 -S -machine pc-i440fx-utopic,accel=kvm,usb=off -cpu Haswell,+abm,+pdpe1gb,+rdrand,+f16c,+osxsave,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -m 2048 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -object memory-backend-ram,size=2048M,id=ram-node0,host-nodes=1,policy=bind -numa node,nodeid=0,cpus=0,memdev=ram-node0 -uuid 9426ece6-6c1f-403d-a7bf-fa8fb975b321 -smbios type=1,manufacturer=OpenStack Foundation,product=OpenStack Nova,version=2015.1.2,serial=31333937-3136-5a43-4a35-333130485632,uuid=9426ece6-6c1f-403d-a7bf-fa8fb975b321 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-00000327.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/9426ece6-6c1f-403d-a7bf-fa8fb975b321/disk,if=none,id=drive-virtio-disk0,format=raw,cache=none,iops_rd=20000,iops_wr=20000 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=27,id=hostnet0,vhost=on,vhostfd=44 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:ea:40:eb,bus=pci.0,addr=0x3 -chardev file,id=charserial0,path=/var/lib/nova/instances/9426ece6-6c1f-403d-a7bf-fa8fb975b321/console.log -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0 -vnc 0.0.0.0:16 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on Here is the command line when using LVM: /usr/bin/qemu-system-x86_64 -name instance-0000033b -S -machine pc-i440fx-utopic,accel=kvm,usb=off -cpu Haswell,+abm,+pdpe1gb,+rdrand,+f16c,+osxsave,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -m 2048 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -object memory-backend-ram,size=2048M,id=ram-node0,host-nodes=0,policy=bind -numa node,nodeid=0,cpus=0,memdev=ram-node0 -uuid d84d39c7-beba-4895-83ac-d2718fdee3f3 -smbios type=1,manufacturer=OpenStack Foundation,product=OpenStack Nova,version=2015.1.2,serial=31333937-3136-5a43-4a35-333130485632,uuid=d84d39c7-beba-4895-83ac-d2718fdee3f3 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-0000033b.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/dev/ssdGroup1/d84d39c7-beba-4895-83ac-d2718fdee3f3_disk,if=none,id=drive-virtio-disk0,format=raw,cache=none,iops_rd=20000,iops_wr=20000 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=27,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:4c:1a:66,bus=pci.0,addr=0x3 -chardev file,id=charserial0,path=/var/lib/nova/instances/d84d39c7-beba-4895-83ac-d2718fdee3f3/console.log -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0 -vnc 0.0.0.0:0 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on root 18449 0.0 0.0 11748 2184 pts/3 S+ 15:09 0:00 grep --color=auto qemu The virtual disk part for raw file: -drive file=/var/lib/nova/instances/9426ece6-6c1f-403d-a7bf-fa8fb975b321/disk,if=none,id=drive-virtio-disk0,format=raw,cache=none,iops_rd=20000,iops_wr=20000 The virtual disk part for LVM: -drive file=/dev/ssdGroup1/d84d39c7-beba-4895-83ac-d2718fdee3f3_disk,if=none,id=drive-virtio-disk0,format=raw,cache=none,iops_rd=20000,iops_wr=20000 Cheers, Prema ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: EXT4 vs LVM performance for VMs 2016-02-12 14:13 ` Premysl Kouril @ 2016-02-12 16:53 ` Theodore Ts'o 2016-02-12 17:38 ` Premysl Kouril 0 siblings, 1 reply; 14+ messages in thread From: Theodore Ts'o @ 2016-02-12 16:53 UTC (permalink / raw) To: Premysl Kouril; +Cc: Andi Kleen, linux-fsdevel On Fri, Feb 12, 2016 at 03:13:26PM +0100, Premysl Kouril wrote: > The performance results which I posted in my original post are after > all the block allocation is done. In other words: on the testing > machine we do a first fio run which writes and allocates the testing > file (during this first run the fio performance is actually much worse > that what I reported, single VM performance is about 700 IOPS) and > then we do second fio run and we take the benchmark numbers from this > second run. So you allocated the file using a random write workload, so the file was probably not very contiguous. Whereas when you allocated the LVM volume, it was probably allocated contiguously. You can use the filefrag tool to see how fragmented the file might be. All of this being said, what are you trying to do? If you are happy using LVM, feel free to use it. If there are specific features that you want out of the file system, it's best that you explicitly identify what you want, and so we can minimize the cost of the features of what you want. Cheers, - Ted ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: EXT4 vs LVM performance for VMs 2016-02-12 16:53 ` Theodore Ts'o @ 2016-02-12 17:38 ` Premysl Kouril 2016-02-13 2:15 ` Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Premysl Kouril @ 2016-02-12 17:38 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Andi Kleen, linux-fsdevel > All of this being said, what are you trying to do? If you are happy > using LVM, feel free to use it. If there are specific features that > you want out of the file system, it's best that you explicitly > identify what you want, and so we can minimize the cost of the > features of what you want. We are trying to decide whether to use filesystem or LVM for VM storage. It's not that we are happy with LVM - while it performs better there are limitations on LVM side especially when it comes to manageability (for example certain features in OpenStack do only fork if VM is file-based). So, in short, if we would make filesystem to perform better we would rather use filesystem than LVM, (and we don't really have any special requirements in terms of filesystem features). And in order for us to make a good decision I wanted to ask community, if our observations and resultant numbers make sense. Cheers, Prema ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: EXT4 vs LVM performance for VMs 2016-02-12 17:38 ` Premysl Kouril @ 2016-02-13 2:15 ` Dave Chinner 2016-02-13 21:56 ` Sanidhya Kashyap 2016-02-15 18:56 ` Premysl Kouril 0 siblings, 2 replies; 14+ messages in thread From: Dave Chinner @ 2016-02-13 2:15 UTC (permalink / raw) To: Premysl Kouril; +Cc: Theodore Ts'o, Andi Kleen, linux-fsdevel On Fri, Feb 12, 2016 at 06:38:47PM +0100, Premysl Kouril wrote: > > All of this being said, what are you trying to do? If you are happy > > using LVM, feel free to use it. If there are specific features that > > you want out of the file system, it's best that you explicitly > > identify what you want, and so we can minimize the cost of the > > features of what you want. > > > We are trying to decide whether to use filesystem or LVM for VM > storage. It's not that we are happy with LVM - while it performs > better there are limitations on LVM side especially when it comes to > manageability (for example certain features in OpenStack do only fork > if VM is file-based). > > So, in short, if we would make filesystem to perform better we would > rather use filesystem than LVM, (and we don't really have any special > requirements in terms of filesystem features). > > And in order for us to make a good decision I wanted to ask community, > if our observations and resultant numbers make sense. For ext4, this is what you are going to get. How about you try XFS? After all, concurrent direct IO writes is something it is rather good at. i.e. use XFS in both your host and guest. Use raw image files on the host, and to make things roughly even with LVM you'll want to preallocate them. If you don't want to preallocate them (i.e. sparse image files) set them up with an extent size hint of at least 1MB so that it limits fragmentation of the image file. Then configure qemu to use cache=none for it's IO to the image file. On the first write pass to the image file (in either case), you should see ~70-80% of the native underlying device performance because there is some overhead in either allocation (sparse image file) or unwritten extent conversion (preallocated image file). This, of course, asssumes you are not CPU limited in the QEMU process by the addition CPU overhead of file block mapping in the host filesystem vs raw block device IO. On the second write pass you should see 98-99% of the native underlying device performance (again with the assumption that CPU overhead of the host filesystem isn't a limiting factor). As an example, I have a block device that can sustain just under 36k random 4k write IOPS on my host. I have an XFS filesystem (default configs) on that 400GB block device. I created a sparse 500TB image file using: # xfs_io -f -c "extsize 1m" -c "truncate 500t" vm-500t.img And push it into a 16p/16GB RAM guest via: -drive file=/mnt/fast-ssd/vm-500t.img,if=virtio,cache=none,format=raw and in the guest run mkfs.xfs with defaults and mount it with defaults. Then I ran your fio test on that 5 times in a row: write: io=3072.0MB, bw=106393KB/s, iops=26598, runt= 29567msec write: io=3072.0MB, bw=141508KB/s, iops=35377, runt= 22230msec write: io=3072.0MB, bw=141254KB/s, iops=35313, runt= 22270msec write: io=3072.0MB, bw=141115KB/s, iops=35278, runt= 22292msec write: io=3072.0MB, bw=141534KB/s, iops=35383, runt= 22226msec The first run was 26k IOPS, the rest were at 35k IOPS as they overwrite the same blocks in the image file. IOWs, first pass at 75% of device capability, the rest at > 98% of the host measured device capability. All tests reported the full io depth was being used in the guest: IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% The guest OS measured about 30% CPU usage for a single fio run at 35k IOPS: real 0m22.648s user 0m1.678s sys 0m8.175s However, the QEMU process on the host required 4 entire CPUs to sustain this IO load, roughly 50/50 user/system time. IOWs, a large amount of the CPU overhead on such workloads is on the host side in QEMU, not the guest. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: EXT4 vs LVM performance for VMs 2016-02-13 2:15 ` Dave Chinner @ 2016-02-13 21:56 ` Sanidhya Kashyap 2016-02-13 23:40 ` Jaegeuk Kim 2016-02-14 0:01 ` Dave Chinner 2016-02-15 18:56 ` Premysl Kouril 1 sibling, 2 replies; 14+ messages in thread From: Sanidhya Kashyap @ 2016-02-13 21:56 UTC (permalink / raw) To: Dave Chinner Cc: Premysl Kouril, Theodore Ts'o, Andi Kleen, linux-fsdevel, changwoo.m, taesoo, steffen.maass, changwoo, Kashyap, Sanidhya We did quite extensive performance evaluation on file systems, including ext4, XFS, btrfs, F2FS, and tmpfs, in terms of multi-core scalability using micro-benchmarks and application benchmarks. Your workload, i.e., multiple tasks are concurrently overwriting a single file, whose file system blocks are previously written, is quite similar to one of our benchmark. Based on our analysis, none of the file systems supports concurrent update of a file even when each task accesses different region of a file. That is because all file systems hold a lock for an entire file. Only one exception is the concurrent direct I/O of XFS. I think that local file systems need to support the range-based locking, which is common in parallel file systems, to improve concurrency level of I/O operations, specifically write operations. If you can split a single file image into multiple files, you can increase the concurrency level of write operations a little bit. For more details, please take a look at our paper draft: https://sslab.gtisc.gatech.edu/assets/papers/2016/min:fxmark-draft.pdf Though our paper is in review, I think it is okay to share since the review process is single-blinded. You can find our analysis on overwrite operations at Section 5.1.2. Scalability behavior of current file systems are summarized at Section 7. On Fri, Feb 12, 2016 at 9:15 PM, Dave Chinner <david@fromorbit.com> wrote: > On Fri, Feb 12, 2016 at 06:38:47PM +0100, Premysl Kouril wrote: >> > All of this being said, what are you trying to do? If you are happy >> > using LVM, feel free to use it. If there are specific features that >> > you want out of the file system, it's best that you explicitly >> > identify what you want, and so we can minimize the cost of the >> > features of what you want. >> >> >> We are trying to decide whether to use filesystem or LVM for VM >> storage. It's not that we are happy with LVM - while it performs >> better there are limitations on LVM side especially when it comes to >> manageability (for example certain features in OpenStack do only fork >> if VM is file-based). >> >> So, in short, if we would make filesystem to perform better we would >> rather use filesystem than LVM, (and we don't really have any special >> requirements in terms of filesystem features). >> >> And in order for us to make a good decision I wanted to ask community, >> if our observations and resultant numbers make sense. > > For ext4, this is what you are going to get. > > How about you try XFS? After all, concurrent direct IO writes is > something it is rather good at. > > i.e. use XFS in both your host and guest. Use raw image files on the > host, and to make things roughly even with LVM you'll want to > preallocate them. If you don't want to preallocate them (i.e. sparse > image files) set them up with an extent size hint of at least 1MB so > that it limits fragmentation of the image file. Then configure qemu > to use cache=none for it's IO to the image file. > > On the first write pass to the image file (in either case), you > should see ~70-80% of the native underlying device performance > because there is some overhead in either allocation (sparse image > file) or unwritten extent conversion (preallocated image file). > This, of course, asssumes you are not CPU limited in the QEMU > process by the addition CPU overhead of file block mapping in the > host filesystem vs raw block device IO. > > On the second write pass you should see 98-99% of the native > underlying device performance (again with the assumption that CPU > overhead of the host filesystem isn't a limiting factor). > > As an example, I have a block device that can sustain just under 36k > random 4k write IOPS on my host. I have an XFS filesystem (default > configs) on that 400GB block device. I created a sparse 500TB image > file using: > > # xfs_io -f -c "extsize 1m" -c "truncate 500t" vm-500t.img > > And push it into a 16p/16GB RAM guest via: > > -drive file=/mnt/fast-ssd/vm-500t.img,if=virtio,cache=none,format=raw > > and in the guest run mkfs.xfs with defaults and mount it with > defaults. Then I ran your fio test on that 5 times in a row: > > write: io=3072.0MB, bw=106393KB/s, iops=26598, runt= 29567msec > write: io=3072.0MB, bw=141508KB/s, iops=35377, runt= 22230msec > write: io=3072.0MB, bw=141254KB/s, iops=35313, runt= 22270msec > write: io=3072.0MB, bw=141115KB/s, iops=35278, runt= 22292msec > write: io=3072.0MB, bw=141534KB/s, iops=35383, runt= 22226msec > > The first run was 26k IOPS, the rest were at 35k IOPS as they > overwrite the same blocks in the image file. IOWs, first pass at 75% > of device capability, the rest at > 98% of the host measured device > capability. All tests reported the full io depth was being used in > the guest: > > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% > > The guest OS measured about 30% CPU usage for a single fio run at > 35k IOPS: > > real 0m22.648s > user 0m1.678s > sys 0m8.175s > > However, the QEMU process on the host required 4 entire CPUs to > sustain this IO load, roughly 50/50 user/system time. IOWs, a large > amount of the CPU overhead on such workloads is on the host side in > QEMU, not the guest. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: EXT4 vs LVM performance for VMs 2016-02-13 21:56 ` Sanidhya Kashyap @ 2016-02-13 23:40 ` Jaegeuk Kim 2016-02-14 0:01 ` Dave Chinner 1 sibling, 0 replies; 14+ messages in thread From: Jaegeuk Kim @ 2016-02-13 23:40 UTC (permalink / raw) To: Sanidhya Kashyap Cc: Dave Chinner, Premysl Kouril, Theodore Ts'o, Andi Kleen, linux-fsdevel, changwoo.m, taesoo, steffen.maass, changwoo, Kashyap, Sanidhya Hi Sanidhya, It's a very interesting paper to me. Thank you for sharing that. Looking at a glance, I have a question about F2FS where the paper concludes that F2FS serializes every writes. But, I don't agree to that, since cp_rwsem, a rw_semaphore, is used to cease from all the fs operations to perform a checkpoint. Other than that case, every operations including writes just grab read_sem, so there should be no serialization. It seems there is no sync/fsync contention in the workloads. Thanks, On Sat, Feb 13, 2016 at 04:56:18PM -0500, Sanidhya Kashyap wrote: > We did quite extensive performance evaluation on file systems, > including ext4, XFS, btrfs, F2FS, and tmpfs, in terms of multi-core > scalability using micro-benchmarks and application benchmarks. > > Your workload, i.e., multiple tasks are concurrently overwriting a > single file, whose file system blocks are previously written, is quite > similar to one of our benchmark. > > Based on our analysis, none of the file systems supports concurrent > update of a file even when each task accesses different region of > a file. That is because all file systems hold a lock for an entire > file. Only one exception is the concurrent direct I/O of XFS. > > I think that local file systems need to support the range-based > locking, which is common in parallel file systems, to improve > concurrency level of I/O operations, specifically write operations. > > If you can split a single file image into multiple files, you can > increase the concurrency level of write operations a little bit. > > For more details, please take a look at our paper draft: > https://sslab.gtisc.gatech.edu/assets/papers/2016/min:fxmark-draft.pdf > > Though our paper is in review, I think it is okay to share since > the review process is single-blinded. You can find our analysis on > overwrite operations at Section 5.1.2. Scalability behavior of current > file systems are summarized at Section 7. > > On Fri, Feb 12, 2016 at 9:15 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Feb 12, 2016 at 06:38:47PM +0100, Premysl Kouril wrote: > >> > All of this being said, what are you trying to do? If you are happy > >> > using LVM, feel free to use it. If there are specific features that > >> > you want out of the file system, it's best that you explicitly > >> > identify what you want, and so we can minimize the cost of the > >> > features of what you want. > >> > >> > >> We are trying to decide whether to use filesystem or LVM for VM > >> storage. It's not that we are happy with LVM - while it performs > >> better there are limitations on LVM side especially when it comes to > >> manageability (for example certain features in OpenStack do only fork > >> if VM is file-based). > >> > >> So, in short, if we would make filesystem to perform better we would > >> rather use filesystem than LVM, (and we don't really have any special > >> requirements in terms of filesystem features). > >> > >> And in order for us to make a good decision I wanted to ask community, > >> if our observations and resultant numbers make sense. > > > > For ext4, this is what you are going to get. > > > > How about you try XFS? After all, concurrent direct IO writes is > > something it is rather good at. > > > > i.e. use XFS in both your host and guest. Use raw image files on the > > host, and to make things roughly even with LVM you'll want to > > preallocate them. If you don't want to preallocate them (i.e. sparse > > image files) set them up with an extent size hint of at least 1MB so > > that it limits fragmentation of the image file. Then configure qemu > > to use cache=none for it's IO to the image file. > > > > On the first write pass to the image file (in either case), you > > should see ~70-80% of the native underlying device performance > > because there is some overhead in either allocation (sparse image > > file) or unwritten extent conversion (preallocated image file). > > This, of course, asssumes you are not CPU limited in the QEMU > > process by the addition CPU overhead of file block mapping in the > > host filesystem vs raw block device IO. > > > > On the second write pass you should see 98-99% of the native > > underlying device performance (again with the assumption that CPU > > overhead of the host filesystem isn't a limiting factor). > > > > As an example, I have a block device that can sustain just under 36k > > random 4k write IOPS on my host. I have an XFS filesystem (default > > configs) on that 400GB block device. I created a sparse 500TB image > > file using: > > > > # xfs_io -f -c "extsize 1m" -c "truncate 500t" vm-500t.img > > > > And push it into a 16p/16GB RAM guest via: > > > > -drive file=/mnt/fast-ssd/vm-500t.img,if=virtio,cache=none,format=raw > > > > and in the guest run mkfs.xfs with defaults and mount it with > > defaults. Then I ran your fio test on that 5 times in a row: > > > > write: io=3072.0MB, bw=106393KB/s, iops=26598, runt= 29567msec > > write: io=3072.0MB, bw=141508KB/s, iops=35377, runt= 22230msec > > write: io=3072.0MB, bw=141254KB/s, iops=35313, runt= 22270msec > > write: io=3072.0MB, bw=141115KB/s, iops=35278, runt= 22292msec > > write: io=3072.0MB, bw=141534KB/s, iops=35383, runt= 22226msec > > > > The first run was 26k IOPS, the rest were at 35k IOPS as they > > overwrite the same blocks in the image file. IOWs, first pass at 75% > > of device capability, the rest at > 98% of the host measured device > > capability. All tests reported the full io depth was being used in > > the guest: > > > > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% > > > > The guest OS measured about 30% CPU usage for a single fio run at > > 35k IOPS: > > > > real 0m22.648s > > user 0m1.678s > > sys 0m8.175s > > > > However, the QEMU process on the host required 4 entire CPUs to > > sustain this IO load, roughly 50/50 user/system time. IOWs, a large > > amount of the CPU overhead on such workloads is on the host side in > > QEMU, not the guest. > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: EXT4 vs LVM performance for VMs 2016-02-13 21:56 ` Sanidhya Kashyap 2016-02-13 23:40 ` Jaegeuk Kim @ 2016-02-14 0:01 ` Dave Chinner 1 sibling, 0 replies; 14+ messages in thread From: Dave Chinner @ 2016-02-14 0:01 UTC (permalink / raw) To: Sanidhya Kashyap Cc: Premysl Kouril, Theodore Ts'o, Andi Kleen, linux-fsdevel, changwoo.m, taesoo, steffen.maass, changwoo, Kashyap, Sanidhya On Sat, Feb 13, 2016 at 04:56:18PM -0500, Sanidhya Kashyap wrote: > We did quite extensive performance evaluation on file systems, > including ext4, XFS, btrfs, F2FS, and tmpfs, in terms of multi-core > scalability using micro-benchmarks and application benchmarks. > > Your workload, i.e., multiple tasks are concurrently overwriting a > single file, whose file system blocks are previously written, is quite > similar to one of our benchmark. > > Based on our analysis, none of the file systems supports concurrent > update of a file even when each task accesses different region of > a file. That is because all file systems hold a lock for an entire > file. Only one exception is the concurrent direct I/O of XFS. > > I think that local file systems need to support the range-based > locking, which is common in parallel file systems, to improve > concurrency level of I/O operations, specifically write operations. Yes, we've spent a fair bit of time talking about that (pretty sure it was a topic of discussion at last year's LFSMM developer conference), but it really isn't a simply thing to add to the VFS or most filesystems. > If you can split a single file image into multiple files, you can > increase the concurrency level of write operations a little bit. At the cost of increased storage stack complexity. most people don't need extreme performance in their VMs, so a single file is generally adequate on XFS. > For more details, please take a look at our paper draft: > https://sslab.gtisc.gatech.edu/assets/papers/2016/min:fxmark-draft.pdf > > Though our paper is in review, I think it is okay to share since > the review process is single-blinded. You can find our analysis on > overwrite operations at Section 5.1.2. Scalability behavior of current > file systems are summarized at Section 7. It's a nice summary of the issues, but there are no surprises in the paper. i.e. It's all things we already know about and, in some cases, are already looking at solutions (e.g. per-node/per-cpu lists to address inode_sb_list_lock contention, potential for converting i_mutex to an rwsem to allow shared read-only access to directories, etc). The only thing that surprised me is how badly rwsems degrade when contended on large machines. I've done local benchmarks on 16p machines with single file direct IO and pushed to being CPU bound I've measured over 2 million single sector random read IOPS, 1.5 million random overwrite IOPS, and ~800k random write w/ allocate IOPS. IOWs, the IO scalability is there when the lock doesn't degrade (which really is a core OS issue, not so much a fs issue). A couple of things I noticed in the summary: "High locality can cause performance collapse" You imply filesystems try to maintain high locality to improve cache hit rates. Filesystems try to maintain locality in disk allocation to minimise seek time for physical IO on related structures to maintain good performance when /cache misses occur/. IOWs, the scalability of the in-memory caches is completely unrelated to the "high locality" optimisations that filesystem make... "because XFS holds a per-device lock instead of a per-file lock in an O_DIRECT mode" That's a new one - I've never heard anyone say that about XFS (and I've heard a lot of wacky things about XFS!). It's much simpler than that - we don't use the i_mutex in O_DIRECT mode, and instead uses shared read locking on the per-inode IO lock for all IO operations. "Overwriting is as expensive as appending" You shouldn't make generalisations that don't apply generally to the the filesystems you tested. :P FWIW, log->l_icloglock contention in XFS implies the application has an excessive fsync problem - that's the only way that lock can see any sort of significant concurrent access. It's probably just the case that the old-school algorithm the code uses to wait for journal IO completion was never expected to scale to operations on storage that can sustain millions of IOPS. I'll add it to the list of known journalling scalabiity bottlenecks in XFS - there's a lot more issues than your testing has told you about.... :/ Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: EXT4 vs LVM performance for VMs 2016-02-13 2:15 ` Dave Chinner 2016-02-13 21:56 ` Sanidhya Kashyap @ 2016-02-15 18:56 ` Premysl Kouril 2016-02-15 19:11 ` Liu Bo 2016-02-15 23:10 ` Dave Chinner 1 sibling, 2 replies; 14+ messages in thread From: Premysl Kouril @ 2016-02-15 18:56 UTC (permalink / raw) To: Dave Chinner; +Cc: Theodore Ts'o, Andi Kleen, linux-fsdevel Hello Dave, thanks for your suggestion. I've just recreated our tests with the XFS and preallocated raw files and the results seem almost same as with the EXT4. I again checked stuff with my Systemtap script and threads are waiting mostly waiting for locks in following placeS: The main KVM thread: TID: 4532 waited 5135787 ns here: 0xffffffffc18a815b : 0xffffffffc18a815b [stap_eb6b67472fc672bbb457a915ab0fb97_10634+0x915b/0x0] 0xffffffffc18a964b : 0xffffffffc18a964b [stap_eb6b67472fc672bbb457a915ab0fb97_10634+0xa64b/0x0] 0xffffffffc18ab07a : 0xffffffffc18ab07a [stap_eb6b67472fc672bbb457a915ab0fb97_10634+0xc07a/0x0] 0xffffffffc189f014 : 0xffffffffc189f014 [stap_eb6b67472fc672bbb457a915ab0fb97_10634+0x14/0x0] 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel] 0xffffffff8176d839 : schedule+0x29/0x70 [kernel] 0xffffffffc0857c2d : 0xffffffffc0857c2d [kvm] 0xffffffff810b6250 : autoremove_wake_function+0x0/0x40 [kernel] 0xffffffffc0870f41 : 0xffffffffc0870f41 [kvm] 0xffffffffc085ace2 : 0xffffffffc085ace2 [kvm] 0xffffffff812132d8 : fsnotify+0x228/0x2f0 [kernel] 0xffffffff810e7e9a : do_futex+0x10a/0x6a0 [kernel] 0xffffffff811e8890 : do_vfs_ioctl+0x2e0/0x4c0 [kernel] 0xffffffffc0864ce4 : 0xffffffffc0864ce4 [kvm] 0xffffffff811e8af1 : sys_ioctl+0x81/0xa0 [kernel] 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel] Worker threads (KVM worker or kernel worker): TID: 12139 waited 7939986 here: 0xffffffffc1e4f15b : 0xffffffffc1e4f15b [stap_18a0b929d6dbf8774f2b8457b465301_12541+0x915b/0x0] 0xffffffffc1e5065b : 0xffffffffc1e5065b [stap_18a0b929d6dbf8774f2b8457b465301_12541+0xa65b/0x0] 0xffffffffc1e5209a : 0xffffffffc1e5209a [stap_18a0b929d6dbf8774f2b8457b465301_12541+0xc09a/0x0] 0xffffffffc1e46014 : 0xffffffffc1e46014 [stap_18a0b929d6dbf8774f2b8457b465301_12541+0x14/0x0] 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel] 0xffffffff8176d839 : schedule+0x29/0x70 [kernel] 0xffffffff810e511e : futex_wait_queue_me+0xde/0x140 [kernel] 0xffffffff810e5c42 : futex_wait+0x182/0x290 [kernel] 0xffffffff810e7e6e : do_futex+0xde/0x6a0 [kernel] 0xffffffff810e84a1 : SyS_futex+0x71/0x150 [kernel] 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel] TID: 12139 waited 11219902 here: 0xffffffffc1e4f15b : 0xffffffffc1e4f15b [stap_18a0b929d6dbf8774f2b8457b465301_12541+0x915b/0x0] 0xffffffffc1e5065b : 0xffffffffc1e5065b [stap_18a0b929d6dbf8774f2b8457b465301_12541+0xa65b/0x0] 0xffffffffc1e5209a : 0xffffffffc1e5209a [stap_18a0b929d6dbf8774f2b8457b465301_12541+0xc09a/0x0] 0xffffffffc1e46014 : 0xffffffffc1e46014 [stap_18a0b929d6dbf8774f2b8457b465301_12541+0x14/0x0] 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel] 0xffffffff8176dd49 : schedule_preempt_disabled+0x29/0x70 [kernel] 0xffffffff8176fb95 : __mutex_lock_slowpath+0xd5/0x1d0 [kernel] 0xffffffff8176fcaf : mutex_lock+0x1f/0x2f [kernel] 0xffffffffc073c531 : 0xffffffffc073c531 [xfs] 0xffffffffc07a788a : 0xffffffffc07a788a [xfs] 0xffffffffc073dd4c : 0xffffffffc073dd4c [xfs] 0xffffffff811d5251 : new_sync_write+0x81/0xb0 [kernel] 0xffffffff811d5a07 : vfs_write+0xb7/0x1f0 [kernel] 0xffffffff811d6732 : sys_pwrite64+0x72/0xb0 [kernel] 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel] Looking at this, does it suggest that the bottleneck is locking on the VFS layer? Or does my setup actually do DirectIO on the host level? You and Sanidhya mentioned that XFS is good at concurrent DirectIO as it doesn't hold lock on file, but I do see this in the trace: 0xffffffff8176fcaf : mutex_lock+0x1f/0x2f [kernel] 0xffffffffc073c531 : 0xffffffffc073c531 [xfs] 0xffffffffc07a788a : 0xffffffffc07a788a [xfs] 0xffffffffc073dd4c : 0xffffffffc073dd4c [xfs] So either KVM is not doing directIO or there is some lock xfs must hold to do the write, right? Regards, Premysl Kouril On Sat, Feb 13, 2016 at 3:15 AM, Dave Chinner <david@fromorbit.com> wrote: > On Fri, Feb 12, 2016 at 06:38:47PM +0100, Premysl Kouril wrote: >> > All of this being said, what are you trying to do? If you are happy >> > using LVM, feel free to use it. If there are specific features that >> > you want out of the file system, it's best that you explicitly >> > identify what you want, and so we can minimize the cost of the >> > features of what you want. >> >> >> We are trying to decide whether to use filesystem or LVM for VM >> storage. It's not that we are happy with LVM - while it performs >> better there are limitations on LVM side especially when it comes to >> manageability (for example certain features in OpenStack do only fork >> if VM is file-based). >> >> So, in short, if we would make filesystem to perform better we would >> rather use filesystem than LVM, (and we don't really have any special >> requirements in terms of filesystem features). >> >> And in order for us to make a good decision I wanted to ask community, >> if our observations and resultant numbers make sense. > > For ext4, this is what you are going to get. > > How about you try XFS? After all, concurrent direct IO writes is > something it is rather good at. > > i.e. use XFS in both your host and guest. Use raw image files on the > host, and to make things roughly even with LVM you'll want to > preallocate them. If you don't want to preallocate them (i.e. sparse > image files) set them up with an extent size hint of at least 1MB so > that it limits fragmentation of the image file. Then configure qemu > to use cache=none for it's IO to the image file. > > On the first write pass to the image file (in either case), you > should see ~70-80% of the native underlying device performance > because there is some overhead in either allocation (sparse image > file) or unwritten extent conversion (preallocated image file). > This, of course, asssumes you are not CPU limited in the QEMU > process by the addition CPU overhead of file block mapping in the > host filesystem vs raw block device IO. > > On the second write pass you should see 98-99% of the native > underlying device performance (again with the assumption that CPU > overhead of the host filesystem isn't a limiting factor). > > As an example, I have a block device that can sustain just under 36k > random 4k write IOPS on my host. I have an XFS filesystem (default > configs) on that 400GB block device. I created a sparse 500TB image > file using: > > # xfs_io -f -c "extsize 1m" -c "truncate 500t" vm-500t.img > > And push it into a 16p/16GB RAM guest via: > > -drive file=/mnt/fast-ssd/vm-500t.img,if=virtio,cache=none,format=raw > > and in the guest run mkfs.xfs with defaults and mount it with > defaults. Then I ran your fio test on that 5 times in a row: > > write: io=3072.0MB, bw=106393KB/s, iops=26598, runt= 29567msec > write: io=3072.0MB, bw=141508KB/s, iops=35377, runt= 22230msec > write: io=3072.0MB, bw=141254KB/s, iops=35313, runt= 22270msec > write: io=3072.0MB, bw=141115KB/s, iops=35278, runt= 22292msec > write: io=3072.0MB, bw=141534KB/s, iops=35383, runt= 22226msec > > The first run was 26k IOPS, the rest were at 35k IOPS as they > overwrite the same blocks in the image file. IOWs, first pass at 75% > of device capability, the rest at > 98% of the host measured device > capability. All tests reported the full io depth was being used in > the guest: > > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% > > The guest OS measured about 30% CPU usage for a single fio run at > 35k IOPS: > > real 0m22.648s > user 0m1.678s > sys 0m8.175s > > However, the QEMU process on the host required 4 entire CPUs to > sustain this IO load, roughly 50/50 user/system time. IOWs, a large > amount of the CPU overhead on such workloads is on the host side in > QEMU, not the guest. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: EXT4 vs LVM performance for VMs 2016-02-15 18:56 ` Premysl Kouril @ 2016-02-15 19:11 ` Liu Bo 2016-02-15 23:10 ` Dave Chinner 1 sibling, 0 replies; 14+ messages in thread From: Liu Bo @ 2016-02-15 19:11 UTC (permalink / raw) To: Premysl Kouril; +Cc: Dave Chinner, Theodore Ts'o, Andi Kleen, linux-fsdevel On Mon, Feb 15, 2016 at 07:56:05PM +0100, Premysl Kouril wrote: > Hello Dave, > > thanks for your suggestion. I've just recreated our tests with the XFS > and preallocated raw files and the results seem almost same as with > the EXT4. I again checked stuff with my Systemtap script and threads > are waiting mostly waiting for locks in following placeS: > > The main KVM thread: > > > TID: 4532 waited 5135787 ns here: > 0xffffffffc18a815b : 0xffffffffc18a815b > [stap_eb6b67472fc672bbb457a915ab0fb97_10634+0x915b/0x0] > 0xffffffffc18a964b : 0xffffffffc18a964b > [stap_eb6b67472fc672bbb457a915ab0fb97_10634+0xa64b/0x0] > 0xffffffffc18ab07a : 0xffffffffc18ab07a > [stap_eb6b67472fc672bbb457a915ab0fb97_10634+0xc07a/0x0] > 0xffffffffc189f014 : 0xffffffffc189f014 > [stap_eb6b67472fc672bbb457a915ab0fb97_10634+0x14/0x0] > 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel] > 0xffffffff8176d839 : schedule+0x29/0x70 [kernel] > 0xffffffffc0857c2d : 0xffffffffc0857c2d [kvm] > 0xffffffff810b6250 : autoremove_wake_function+0x0/0x40 [kernel] > 0xffffffffc0870f41 : 0xffffffffc0870f41 [kvm] > 0xffffffffc085ace2 : 0xffffffffc085ace2 [kvm] > 0xffffffff812132d8 : fsnotify+0x228/0x2f0 [kernel] > 0xffffffff810e7e9a : do_futex+0x10a/0x6a0 [kernel] > 0xffffffff811e8890 : do_vfs_ioctl+0x2e0/0x4c0 [kernel] > 0xffffffffc0864ce4 : 0xffffffffc0864ce4 [kvm] > 0xffffffff811e8af1 : sys_ioctl+0x81/0xa0 [kernel] > 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel] > > > > Worker threads (KVM worker or kernel worker): > > > TID: 12139 waited 7939986 here: > 0xffffffffc1e4f15b : 0xffffffffc1e4f15b > [stap_18a0b929d6dbf8774f2b8457b465301_12541+0x915b/0x0] > 0xffffffffc1e5065b : 0xffffffffc1e5065b > [stap_18a0b929d6dbf8774f2b8457b465301_12541+0xa65b/0x0] > 0xffffffffc1e5209a : 0xffffffffc1e5209a > [stap_18a0b929d6dbf8774f2b8457b465301_12541+0xc09a/0x0] > 0xffffffffc1e46014 : 0xffffffffc1e46014 > [stap_18a0b929d6dbf8774f2b8457b465301_12541+0x14/0x0] > 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel] > 0xffffffff8176d839 : schedule+0x29/0x70 [kernel] > 0xffffffff810e511e : futex_wait_queue_me+0xde/0x140 [kernel] > 0xffffffff810e5c42 : futex_wait+0x182/0x290 [kernel] > 0xffffffff810e7e6e : do_futex+0xde/0x6a0 [kernel] > 0xffffffff810e84a1 : SyS_futex+0x71/0x150 [kernel] > 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel] > > > > TID: 12139 waited 11219902 here: > 0xffffffffc1e4f15b : 0xffffffffc1e4f15b > [stap_18a0b929d6dbf8774f2b8457b465301_12541+0x915b/0x0] > 0xffffffffc1e5065b : 0xffffffffc1e5065b > [stap_18a0b929d6dbf8774f2b8457b465301_12541+0xa65b/0x0] > 0xffffffffc1e5209a : 0xffffffffc1e5209a > [stap_18a0b929d6dbf8774f2b8457b465301_12541+0xc09a/0x0] > 0xffffffffc1e46014 : 0xffffffffc1e46014 > [stap_18a0b929d6dbf8774f2b8457b465301_12541+0x14/0x0] > 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel] > 0xffffffff8176dd49 : schedule_preempt_disabled+0x29/0x70 [kernel] > 0xffffffff8176fb95 : __mutex_lock_slowpath+0xd5/0x1d0 [kernel] > 0xffffffff8176fcaf : mutex_lock+0x1f/0x2f [kernel] > 0xffffffffc073c531 : 0xffffffffc073c531 [xfs] > 0xffffffffc07a788a : 0xffffffffc07a788a [xfs] > 0xffffffffc073dd4c : 0xffffffffc073dd4c [xfs] > 0xffffffff811d5251 : new_sync_write+0x81/0xb0 [kernel] > 0xffffffff811d5a07 : vfs_write+0xb7/0x1f0 [kernel] > 0xffffffff811d6732 : sys_pwrite64+0x72/0xb0 [kernel] > 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel] > > > > > Looking at this, does it suggest that the bottleneck is locking on the > VFS layer? Or does my setup actually do DirectIO on the host level? > You and Sanidhya mentioned that XFS is good at concurrent DirectIO as > it doesn't hold lock on file, but I do see this in the trace: > > 0xffffffff8176fcaf : mutex_lock+0x1f/0x2f [kernel] > 0xffffffffc073c531 : 0xffffffffc073c531 [xfs] > 0xffffffffc07a788a : 0xffffffffc07a788a [xfs] > 0xffffffffc073dd4c : 0xffffffffc073dd4c [xfs] > > So either KVM is not doing directIO or there is some lock xfs must > hold to do the write, right? Is this gathered when qemu is binded to single CPU? fio takes iodepth=64, but blk-mq uses per-cpu or per-node queue. Not sure if blk-mq is available on 3.16.0. Thanks, -liubo > > > Regards, > Premysl Kouril > > > > > > > > On Sat, Feb 13, 2016 at 3:15 AM, Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Feb 12, 2016 at 06:38:47PM +0100, Premysl Kouril wrote: > >> > All of this being said, what are you trying to do? If you are happy > >> > using LVM, feel free to use it. If there are specific features that > >> > you want out of the file system, it's best that you explicitly > >> > identify what you want, and so we can minimize the cost of the > >> > features of what you want. > >> > >> > >> We are trying to decide whether to use filesystem or LVM for VM > >> storage. It's not that we are happy with LVM - while it performs > >> better there are limitations on LVM side especially when it comes to > >> manageability (for example certain features in OpenStack do only fork > >> if VM is file-based). > >> > >> So, in short, if we would make filesystem to perform better we would > >> rather use filesystem than LVM, (and we don't really have any special > >> requirements in terms of filesystem features). > >> > >> And in order for us to make a good decision I wanted to ask community, > >> if our observations and resultant numbers make sense. > > > > For ext4, this is what you are going to get. > > > > How about you try XFS? After all, concurrent direct IO writes is > > something it is rather good at. > > > > i.e. use XFS in both your host and guest. Use raw image files on the > > host, and to make things roughly even with LVM you'll want to > > preallocate them. If you don't want to preallocate them (i.e. sparse > > image files) set them up with an extent size hint of at least 1MB so > > that it limits fragmentation of the image file. Then configure qemu > > to use cache=none for it's IO to the image file. > > > > On the first write pass to the image file (in either case), you > > should see ~70-80% of the native underlying device performance > > because there is some overhead in either allocation (sparse image > > file) or unwritten extent conversion (preallocated image file). > > This, of course, asssumes you are not CPU limited in the QEMU > > process by the addition CPU overhead of file block mapping in the > > host filesystem vs raw block device IO. > > > > On the second write pass you should see 98-99% of the native > > underlying device performance (again with the assumption that CPU > > overhead of the host filesystem isn't a limiting factor). > > > > As an example, I have a block device that can sustain just under 36k > > random 4k write IOPS on my host. I have an XFS filesystem (default > > configs) on that 400GB block device. I created a sparse 500TB image > > file using: > > > > # xfs_io -f -c "extsize 1m" -c "truncate 500t" vm-500t.img > > > > And push it into a 16p/16GB RAM guest via: > > > > -drive file=/mnt/fast-ssd/vm-500t.img,if=virtio,cache=none,format=raw > > > > and in the guest run mkfs.xfs with defaults and mount it with > > defaults. Then I ran your fio test on that 5 times in a row: > > > > write: io=3072.0MB, bw=106393KB/s, iops=26598, runt= 29567msec > > write: io=3072.0MB, bw=141508KB/s, iops=35377, runt= 22230msec > > write: io=3072.0MB, bw=141254KB/s, iops=35313, runt= 22270msec > > write: io=3072.0MB, bw=141115KB/s, iops=35278, runt= 22292msec > > write: io=3072.0MB, bw=141534KB/s, iops=35383, runt= 22226msec > > > > The first run was 26k IOPS, the rest were at 35k IOPS as they > > overwrite the same blocks in the image file. IOWs, first pass at 75% > > of device capability, the rest at > 98% of the host measured device > > capability. All tests reported the full io depth was being used in > > the guest: > > > > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% > > > > The guest OS measured about 30% CPU usage for a single fio run at > > 35k IOPS: > > > > real 0m22.648s > > user 0m1.678s > > sys 0m8.175s > > > > However, the QEMU process on the host required 4 entire CPUs to > > sustain this IO load, roughly 50/50 user/system time. IOWs, a large > > amount of the CPU overhead on such workloads is on the host side in > > QEMU, not the guest. > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: EXT4 vs LVM performance for VMs 2016-02-15 18:56 ` Premysl Kouril 2016-02-15 19:11 ` Liu Bo @ 2016-02-15 23:10 ` Dave Chinner 1 sibling, 0 replies; 14+ messages in thread From: Dave Chinner @ 2016-02-15 23:10 UTC (permalink / raw) To: Premysl Kouril; +Cc: Theodore Ts'o, Andi Kleen, linux-fsdevel On Mon, Feb 15, 2016 at 07:56:05PM +0100, Premysl Kouril wrote: > Hello Dave, > > thanks for your suggestion. I've just recreated our tests with the XFS > and preallocated raw files and the results seem almost same as with > the EXT4. I again checked stuff with my Systemtap script and threads > are waiting mostly waiting for locks in following placeS: .... > Looking at this, does it suggest that the bottleneck is locking on the > VFS layer? Or does my setup actually do DirectIO on the host level? > You and Sanidhya mentioned that XFS is good at concurrent DirectIO as > it doesn't hold lock on file, but I do see this in the trace: > > 0xffffffff8176fcaf : mutex_lock+0x1f/0x2f [kernel] > 0xffffffffc073c531 : 0xffffffffc073c531 [xfs] > 0xffffffffc07a788a : 0xffffffffc07a788a [xfs] > 0xffffffffc073dd4c : 0xffffffffc073dd4c [xfs] You need to resolve these addresses to symbols so we can see where this is actually blocking. > So either KVM is not doing directIO or there is some lock xfs must > hold to do the write, right? My guess is that you didn't configure KVM to use direct Io correctly, because if it was XFs blocking on internal locks in direct IO it would be on a rwsem, not a mutex. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2016-02-15 23:10 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-02-11 20:50 EXT4 vs LVM performance for VMs Premysl Kouril 2016-02-12 6:12 ` Andi Kleen 2016-02-12 9:09 ` Premysl Kouril 2016-02-12 13:38 ` Theodore Ts'o 2016-02-12 14:13 ` Premysl Kouril 2016-02-12 16:53 ` Theodore Ts'o 2016-02-12 17:38 ` Premysl Kouril 2016-02-13 2:15 ` Dave Chinner 2016-02-13 21:56 ` Sanidhya Kashyap 2016-02-13 23:40 ` Jaegeuk Kim 2016-02-14 0:01 ` Dave Chinner 2016-02-15 18:56 ` Premysl Kouril 2016-02-15 19:11 ` Liu Bo 2016-02-15 23:10 ` Dave Chinner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.