EXT4 vs LVM performance for VMs

* EXT4 vs LVM performance for VMs
@ 2016-02-11 20:50 Premysl Kouril
  2016-02-12  6:12 ` Andi Kleen
  0 siblings, 1 reply; 14+ messages in thread
From: Premysl Kouril @ 2016-02-11 20:50 UTC (permalink / raw)
  To: linux-fsdevel

Hi All,

We are in process of setting up a new cloud infrastructure and we are
deciding if we should use file-backed virtual machines or
LVM-volume-backed virtual machines and I would like to kindly ask
community to confirm some of our performance related findings and/or
advice if there is something that can be done about it.

I heard that performance difference between LVM volumes and files on
filesystem (when using for VM disks) is only about 1-5% but this is
not what we are seeing.

Regarding our test hypervisor - it is filled with ssd disks each
capable of up to 130 000 write IOPS, has plenty of CPU and RAM.

We test performance by running fio inside virtual machines (KVM based)
hosted on this hypervisor. In order to achieve comparable and
consistent benchmark results, virtual machines are single core VMs and
CPU hyperthreading is turned off on the hypervisor. Furthermore CPU
cores are dedicated for the virtual machines using the cpu pinning (so
particular VM runs only on particular CPU core).

here is the fio command:

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
--name=test --filename=test --bs=4k --iodepth=64 --size=3G --numjobs=1
--readwrite=randwrite

Kernel version: 3.16.0-60-generic

We mount the filesystem with:
mount -o noatime,data=writeback,barrier=0 /dev/md127 /var/lib/nova/instances/

We also disable journaling for the testing purposes.

Here are results if we use LVM volume as the Virtual machine disk
versus if we use file (qcow2 or raw) stored on EXT4 filesystem for the
VM disk.

1 Test: Maximum sustainable write IOPS achieved on single VM:

LVM: ~ 16 000 IOPS
EXT4: ~ 4 000 IOPS

2 Test: Maximum sustainable write IOPS achieved on hypervisor by
running multiple test VMs:

LVM: ~ 40 000 IOPS (and then MDRAID5 hit 100% CPU utilization)
EXT4: ~ 20 000 IOPS

So basically LVM seems to perform much better.

Note that in the second test the raid started to be bottleneck so it
is possible that LVM layer would be capable of even more on faster
raid.

In the Test 1:
  - on LVM we hit 100% utilization of the qemu VM process where we
had: usr:50%,sys50%,wait:0%
  - on EXT4 we hit 100% utilization of the qemu VM process where we
hed: usr:30%,sys:30%,wait30%

So it seem that performance of EXT4 is significantly lower and when
using EXT4 we saw significant wait time. I tried to look at it a bit
(using some custom Systemtap script) and here is my observartion.

When checking what is going on on the CPU which is executing the KVM
qemu process of the test VM it seems it is executing 2 main threads
(these 2 threads are responsible for most of the time spent on CPU)
and about 60-70 other threads which I assume are some filesystem
workers.

Out of the 2 main threads one of it seems OK and doesn't seems to be
waiting for lock or anything - most of the time I see this thread
leaving the CPU it is normal scheduler interrupt.

The other main thread is actually spending a lot of time waiting for
lock here, basically when this thread is leaving CPU it often does so
here:

TID: 9838 waited 4916317 ns here:
 0xffffffffc1e4f12b : 0xffffffffc1e4f12b
[stap_b9c4a8366b974feec4893d4b5949417_17490+0x912b/0x0]
 0xffffffffc1e5061b : 0xffffffffc1e5061b
[stap_b9c4a8366b974feec4893d4b5949417_17490+0xa61b/0x0]
 0xffffffffc1e51e7a : 0xffffffffc1e51e7a
[stap_b9c4a8366b974feec4893d4b5949417_17490+0xbe7a/0x0]
 0xffffffffc1e46014 : 0xffffffffc1e46014
[stap_b9c4a8366b974feec4893d4b5949417_17490+0x14/0x0]
 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel]
 0xffffffff8176d839 : schedule+0x29/0x70 [kernel]
 0xffffffffc0857c2d : 0xffffffffc0857c2d [kvm]
 0xffffffff810b6250 : autoremove_wake_function+0x0/0x40 [kernel]
 0xffffffffc0870f41 : 0xffffffffc0870f41 [kvm]
 0xffffffffc085ace2 : 0xffffffffc085ace2 [kvm]
 0xffffffff812132d8 : fsnotify+0x228/0x2f0 [kernel]
 0xffffffff810e7e9a : do_futex+0x10a/0x6a0 [kernel]
 0xffffffff811e8890 : do_vfs_ioctl+0x2e0/0x4c0 [kernel]
 0xffffffffc0864ce4 : 0xffffffffc0864ce4 [kvm]
 0xffffffff811e8af1 : sys_ioctl+0x81/0xa0 [kernel]
 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel]

When looking at the 60 worker threads, they often leave CPU in
following places and wait there some bigger amount of  CPU cycles:

First place:

TID: 57936 waited 2092552 ns here:
 0xffffffffc18a815b : 0xffffffffc18a815b
[stap_140d3798c84d55c6fb0e060c1ecb741_57914+0x915b/0x0]
 0xffffffffc18a965b : 0xffffffffc18a965b
[stap_140d3798c84d55c6fb0e060c1ecb741_57914+0xa65b/0x0]
 0xffffffffc18ab09a : 0xffffffffc18ab09a
[stap_140d3798c84d55c6fb0e060c1ecb741_57914+0xc09a/0x0]
 0xffffffffc189f014 : 0xffffffffc189f014
[stap_140d3798c84d55c6fb0e060c1ecb741_57914+0x14/0x0]
 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel]
 0xffffffff8176d839 : schedule+0x29/0x70 [kernel]
 0xffffffff810e511e : futex_wait_queue_me+0xde/0x140 [kernel]
 0xffffffff810e5c42 : futex_wait+0x182/0x290 [kernel]
 0xffffffff810e7e6e : do_futex+0xde/0x6a0 [kernel]
 0xffffffff810e84a1 : SyS_futex+0x71/0x150 [kernel]
 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel]

Second place:

TID: 57937 waited 1542013 ns here:
 0xffffffffc18a815b : 0xffffffffc18a815b
[stap_140d3798c84d55c6fb0e060c1ecb741_57914+0x915b/0x0]
 0xffffffffc18a965b : 0xffffffffc18a965b
[stap_140d3798c84d55c6fb0e060c1ecb741_57914+0xa65b/0x0]
 0xffffffffc18ab09a : 0xffffffffc18ab09a
[stap_140d3798c84d55c6fb0e060c1ecb741_57914+0xc09a/0x0]
 0xffffffffc189f014 : 0xffffffffc189f014
[stap_140d3798c84d55c6fb0e060c1ecb741_57914+0x14/0x0]
 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel]
 0xffffffff8176dd49 : schedule_preempt_disabled+0x29/0x70 [kernel]
 0xffffffff8176fb95 : __mutex_lock_slowpath+0xd5/0x1d0 [kernel]
 0xffffffff8176fcaf : mutex_lock+0x1f/0x2f [kernel]
 0xffffffff81250ca9 : ext4_file_write_iter+0x79/0x3a0 [kernel]
 0xffffffff811d5251 : new_sync_write+0x81/0xb0 [kernel]
 0xffffffff811d5a07 : vfs_write+0xb7/0x1f0 [kernel]
 0xffffffff811d6732 : sys_pwrite64+0x72/0xb0 [kernel]
 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel]

So basically we attribute the lower EXT4 performance to these points
where things need to be synchronized using locks but this is just what
we see at high level so I would be curious if dev community thinks
this might be the cause.

All in all I'd like to ask following questions:

1) Are the benchmark results as you would expect?
2) Can the lower performance be attributed to the locking?
2) Is there something we could do to improve performance of the filesystem?
3) Are there any plans for development in this area?

Regards,
Premysl Kouril

^ permalink raw reply	[flat|nested] 14+ messages in thread