All of lore.kernel.org
 help / color / mirror / Atom feed
* EXT4 vs LVM performance for VMs
@ 2016-02-11 20:50 Premysl Kouril
  2016-02-12  6:12 ` Andi Kleen
  0 siblings, 1 reply; 14+ messages in thread
From: Premysl Kouril @ 2016-02-11 20:50 UTC (permalink / raw)
  To: linux-fsdevel

Hi All,

We are in process of setting up a new cloud infrastructure and we are
deciding if we should use file-backed virtual machines or
LVM-volume-backed virtual machines and I would like to kindly ask
community to confirm some of our performance related findings and/or
advice if there is something that can be done about it.

I heard that performance difference between LVM volumes and files on
filesystem (when using for VM disks) is only about 1-5% but this is
not what we are seeing.

Regarding our test hypervisor - it is filled with ssd disks each
capable of up to 130 000 write IOPS, has plenty of CPU and RAM.

We test performance by running fio inside virtual machines (KVM based)
hosted on this hypervisor. In order to achieve comparable and
consistent benchmark results, virtual machines are single core VMs and
CPU hyperthreading is turned off on the hypervisor. Furthermore CPU
cores are dedicated for the virtual machines using the cpu pinning (so
particular VM runs only on particular CPU core).

here is the fio command:

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
--name=test --filename=test --bs=4k --iodepth=64 --size=3G --numjobs=1
--readwrite=randwrite

Kernel version: 3.16.0-60-generic

We mount the filesystem with:
mount -o noatime,data=writeback,barrier=0 /dev/md127 /var/lib/nova/instances/

We also disable journaling for the testing purposes.


Here are results if we use LVM volume as the Virtual machine disk
versus if we use file (qcow2 or raw) stored on EXT4 filesystem for the
VM disk.

1 Test: Maximum sustainable write IOPS achieved on single VM:

LVM: ~ 16 000 IOPS
EXT4: ~ 4 000 IOPS

2 Test: Maximum sustainable write IOPS achieved on hypervisor by
running multiple test VMs:

LVM: ~ 40 000 IOPS (and then MDRAID5 hit 100% CPU utilization)
EXT4: ~ 20 000 IOPS

So basically LVM seems to perform much better.

Note that in the second test the raid started to be bottleneck so it
is possible that LVM layer would be capable of even more on faster
raid.


In the Test 1:
  - on LVM we hit 100% utilization of the qemu VM process where we
had: usr:50%,sys50%,wait:0%
  - on EXT4 we hit 100% utilization of the qemu VM process where we
hed: usr:30%,sys:30%,wait30%



So it seem that performance of EXT4 is significantly lower and when
using EXT4 we saw significant wait time. I tried to look at it a bit
(using some custom Systemtap script) and here is my observartion.

When checking what is going on on the CPU which is executing the KVM
qemu process of the test VM it seems it is executing 2 main threads
(these 2 threads are responsible for most of the time spent on CPU)
and about 60-70 other threads which I assume are some filesystem
workers.

Out of the 2 main threads one of it seems OK and doesn't seems to be
waiting for lock or anything - most of the time I see this thread
leaving the CPU it is normal scheduler interrupt.


The other main thread is actually spending a lot of time waiting for
lock here, basically when this thread is leaving CPU it often does so
here:

TID: 9838 waited 4916317 ns here:
 0xffffffffc1e4f12b : 0xffffffffc1e4f12b
[stap_b9c4a8366b974feec4893d4b5949417_17490+0x912b/0x0]
 0xffffffffc1e5061b : 0xffffffffc1e5061b
[stap_b9c4a8366b974feec4893d4b5949417_17490+0xa61b/0x0]
 0xffffffffc1e51e7a : 0xffffffffc1e51e7a
[stap_b9c4a8366b974feec4893d4b5949417_17490+0xbe7a/0x0]
 0xffffffffc1e46014 : 0xffffffffc1e46014
[stap_b9c4a8366b974feec4893d4b5949417_17490+0x14/0x0]
 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel]
 0xffffffff8176d839 : schedule+0x29/0x70 [kernel]
 0xffffffffc0857c2d : 0xffffffffc0857c2d [kvm]
 0xffffffff810b6250 : autoremove_wake_function+0x0/0x40 [kernel]
 0xffffffffc0870f41 : 0xffffffffc0870f41 [kvm]
 0xffffffffc085ace2 : 0xffffffffc085ace2 [kvm]
 0xffffffff812132d8 : fsnotify+0x228/0x2f0 [kernel]
 0xffffffff810e7e9a : do_futex+0x10a/0x6a0 [kernel]
 0xffffffff811e8890 : do_vfs_ioctl+0x2e0/0x4c0 [kernel]
 0xffffffffc0864ce4 : 0xffffffffc0864ce4 [kvm]
 0xffffffff811e8af1 : sys_ioctl+0x81/0xa0 [kernel]
 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel]


When looking at the 60 worker threads, they often leave CPU in
following places and wait there some bigger amount of  CPU cycles:

First place:

TID: 57936 waited 2092552 ns here:
 0xffffffffc18a815b : 0xffffffffc18a815b
[stap_140d3798c84d55c6fb0e060c1ecb741_57914+0x915b/0x0]
 0xffffffffc18a965b : 0xffffffffc18a965b
[stap_140d3798c84d55c6fb0e060c1ecb741_57914+0xa65b/0x0]
 0xffffffffc18ab09a : 0xffffffffc18ab09a
[stap_140d3798c84d55c6fb0e060c1ecb741_57914+0xc09a/0x0]
 0xffffffffc189f014 : 0xffffffffc189f014
[stap_140d3798c84d55c6fb0e060c1ecb741_57914+0x14/0x0]
 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel]
 0xffffffff8176d839 : schedule+0x29/0x70 [kernel]
 0xffffffff810e511e : futex_wait_queue_me+0xde/0x140 [kernel]
 0xffffffff810e5c42 : futex_wait+0x182/0x290 [kernel]
 0xffffffff810e7e6e : do_futex+0xde/0x6a0 [kernel]
 0xffffffff810e84a1 : SyS_futex+0x71/0x150 [kernel]
 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel]

Second place:

TID: 57937 waited 1542013 ns here:
 0xffffffffc18a815b : 0xffffffffc18a815b
[stap_140d3798c84d55c6fb0e060c1ecb741_57914+0x915b/0x0]
 0xffffffffc18a965b : 0xffffffffc18a965b
[stap_140d3798c84d55c6fb0e060c1ecb741_57914+0xa65b/0x0]
 0xffffffffc18ab09a : 0xffffffffc18ab09a
[stap_140d3798c84d55c6fb0e060c1ecb741_57914+0xc09a/0x0]
 0xffffffffc189f014 : 0xffffffffc189f014
[stap_140d3798c84d55c6fb0e060c1ecb741_57914+0x14/0x0]
 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel]
 0xffffffff8176dd49 : schedule_preempt_disabled+0x29/0x70 [kernel]
 0xffffffff8176fb95 : __mutex_lock_slowpath+0xd5/0x1d0 [kernel]
 0xffffffff8176fcaf : mutex_lock+0x1f/0x2f [kernel]
 0xffffffff81250ca9 : ext4_file_write_iter+0x79/0x3a0 [kernel]
 0xffffffff811d5251 : new_sync_write+0x81/0xb0 [kernel]
 0xffffffff811d5a07 : vfs_write+0xb7/0x1f0 [kernel]
 0xffffffff811d6732 : sys_pwrite64+0x72/0xb0 [kernel]
 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel]


So basically we attribute the lower EXT4 performance to these points
where things need to be synchronized using locks but this is just what
we see at high level so I would be curious if dev community thinks
this might be the cause.

All in all I'd like to ask following questions:

1) Are the benchmark results as you would expect?
2) Can the lower performance be attributed to the locking?
2) Is there something we could do to improve performance of the filesystem?
3) Are there any plans for development in this area?

Regards,
Premysl Kouril

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: EXT4 vs LVM performance for VMs
  2016-02-11 20:50 EXT4 vs LVM performance for VMs Premysl Kouril
@ 2016-02-12  6:12 ` Andi Kleen
  2016-02-12  9:09   ` Premysl Kouril
  0 siblings, 1 reply; 14+ messages in thread
From: Andi Kleen @ 2016-02-12  6:12 UTC (permalink / raw)
  To: Premysl Kouril; +Cc: linux-fsdevel

Premysl Kouril <premysl.kouril@gmail.com> writes:
>
> So basically we attribute the lower EXT4 performance to these points
> where things need to be synchronized using locks but this is just what
> we see at high level so I would be curious if dev community thinks
> this might be the cause.

Except for the last the backtraces you're showing are for futex locks,
which are not used by the kernel, but some user process. So the locking
problem is somewhere in the user space setup (perhaps in qemu).
This would indicate your ext4 set up is not the same as LVM.

The later is the inode mutex which is needed for POSIX semantics
to get atomic writes.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: EXT4 vs LVM performance for VMs
  2016-02-12  6:12 ` Andi Kleen
@ 2016-02-12  9:09   ` Premysl Kouril
  2016-02-12 13:38     ` Theodore Ts'o
  0 siblings, 1 reply; 14+ messages in thread
From: Premysl Kouril @ 2016-02-12  9:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-fsdevel

> Except for the last the backtraces you're showing are for futex locks,
> which are not used by the kernel, but some user process. So the locking
> problem is somewhere in the user space setup (perhaps in qemu).
> This would indicate your ext4 set up is not the same as LVM.

Setup on the qemu/kvm is exactly the same (same test box, same command
line arguments execept for the argument for the virtual machine disk
which reference LVM volume or EXT4 based file)



>
> The later is the inode mutex which is needed for POSIX semantics
> to get atomic writes.
>

Hmm, given that the user space thread is futexed in the
0xffffffff812132d8 : fsnotify+0x228/0x2f0 [kernel] isn't it really
that the user space thread is blocked as a result of lock contention
on the kernel inode mutex (ie:  0xffffffff811d5251 :
new_sync_write+0x81/0xb0 [kernel]) and basically waits for the
filesystem to be done with the write ?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: EXT4 vs LVM performance for VMs
  2016-02-12  9:09   ` Premysl Kouril
@ 2016-02-12 13:38     ` Theodore Ts'o
  2016-02-12 14:13       ` Premysl Kouril
  0 siblings, 1 reply; 14+ messages in thread
From: Theodore Ts'o @ 2016-02-12 13:38 UTC (permalink / raw)
  To: Premysl Kouril; +Cc: Andi Kleen, linux-fsdevel

On Fri, Feb 12, 2016 at 10:09:39AM +0100, Premysl Kouril wrote:
> > Except for the last the backtraces you're showing are for futex locks,
> > which are not used by the kernel, but some user process. So the locking
> > problem is somewhere in the user space setup (perhaps in qemu).
> > This would indicate your ext4 set up is not the same as LVM.
> 
> Setup on the qemu/kvm is exactly the same (same test box, same command
> line arguments execept for the argument for the virtual machine disk
> which reference LVM volume or EXT4 based file)

You mentioned using qcow; if you're using qcow, than the userspace
qemu/kvm process will need to do its own locking to manage its own
space management file.

In general, if you need to do allocation management, either because
you are writing to a sparse file, and ext4 has to do block allocation,
or if you are writing to a fallocated file, and ext4 has to keep track
of whether a block has been written to that location before (and if
not, do a journalled transaction to clear the unwritten bit), there
will be extra work that has to be done at the qcow or ext4 layer that
doesn't have to be done at the LVM layer.  Of course, this work is
providing extra services (such as space management and/or not
revealing previously written block contents from another customer to
your current customer's VM, which might make your local data
protection authorities cranky).

So the devil is very much in the details of how you set up the
hypervisor, and "except for the argument for the virtual machine disk"
is leaving an awful lot unspecified.  Whether you preallocated and
pre-zeroed the file can make a difference, whether you are using
buffered or direct I/O based on the cache parameter makes a
difference, etc., etc.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: EXT4 vs LVM performance for VMs
  2016-02-12 13:38     ` Theodore Ts'o
@ 2016-02-12 14:13       ` Premysl Kouril
  2016-02-12 16:53         ` Theodore Ts'o
  0 siblings, 1 reply; 14+ messages in thread
From: Premysl Kouril @ 2016-02-12 14:13 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Andi Kleen, linux-fsdevel

>
> You mentioned using qcow; if you're using qcow, than the userspace
> qemu/kvm process will need to do its own locking to manage its own
> space management file.
>

I tried raw files too, there is not much difference in performance.




> In general, if you need to do allocation management, either because
> you are writing to a sparse file, and ext4 has to do block allocation,
> or if you are writing to a fallocated file, and ext4 has to keep track
> of whether a block has been written to that location before (and if
> not, do a journalled transaction to clear the unwritten bit), there
> will be extra work that has to be done at the qcow or ext4 layer that
> doesn't have to be done at the LVM layer.  Of course, this work is
> providing extra services (such as space management and/or not
> revealing previously written block contents from another customer to
> your current customer's VM, which might make your local data
> protection authorities cranky).
>

The performance results which I posted in my original post are after
all the block allocation is done. In other words: on the testing
machine we do a first fio run which writes and allocates the testing
file (during this first run the fio performance is actually much worse
that what I reported, single VM performance is about 700 IOPS) and
then we do second fio run and we take the benchmark numbers from this
second run.


> So the devil is very much in the details of how you set up the
> hypervisor, and "except for the argument for the virtual machine disk"
> is leaving an awful lot unspecified.  Whether you preallocated and
> pre-zeroed the file can make a difference, whether you are using
> buffered or direct I/O based on the cache parameter makes a
> difference, etc., etc.


Here is the command line of the qemu when using raw files on EXT4:

usr/bin/qemu-system-x86_64 -name instance-00000327 -S -machine
pc-i440fx-utopic,accel=kvm,usb=off -cpu
Haswell,+abm,+pdpe1gb,+rdrand,+f16c,+osxsave,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
-m 2048 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -object
memory-backend-ram,size=2048M,id=ram-node0,host-nodes=1,policy=bind
-numa node,nodeid=0,cpus=0,memdev=ram-node0 -uuid
9426ece6-6c1f-403d-a7bf-fa8fb975b321 -smbios
type=1,manufacturer=OpenStack Foundation,product=OpenStack
Nova,version=2015.1.2,serial=31333937-3136-5a43-4a35-333130485632,uuid=9426ece6-6c1f-403d-a7bf-fa8fb975b321
-no-user-config -nodefaults -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-00000327.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc
base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard
-no-hpet -no-shutdown -boot strict=on -device
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
file=/var/lib/nova/instances/9426ece6-6c1f-403d-a7bf-fa8fb975b321/disk,if=none,id=drive-virtio-disk0,format=raw,cache=none,iops_rd=20000,iops_wr=20000
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
-netdev tap,fd=27,id=hostnet0,vhost=on,vhostfd=44 -device
virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:ea:40:eb,bus=pci.0,addr=0x3
-chardev file,id=charserial0,path=/var/lib/nova/instances/9426ece6-6c1f-403d-a7bf-fa8fb975b321/console.log
-device isa-serial,chardev=charserial0,id=serial0 -chardev
pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1
-device usb-tablet,id=input0 -vnc 0.0.0.0:16 -k en-us -device
cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on

Here is the command line when using LVM:


/usr/bin/qemu-system-x86_64 -name instance-0000033b -S -machine
pc-i440fx-utopic,accel=kvm,usb=off -cpu
Haswell,+abm,+pdpe1gb,+rdrand,+f16c,+osxsave,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
-m 2048 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -object
memory-backend-ram,size=2048M,id=ram-node0,host-nodes=0,policy=bind
-numa node,nodeid=0,cpus=0,memdev=ram-node0 -uuid
d84d39c7-beba-4895-83ac-d2718fdee3f3 -smbios
type=1,manufacturer=OpenStack Foundation,product=OpenStack
Nova,version=2015.1.2,serial=31333937-3136-5a43-4a35-333130485632,uuid=d84d39c7-beba-4895-83ac-d2718fdee3f3
-no-user-config -nodefaults -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-0000033b.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc
base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard
-no-hpet -no-shutdown -boot strict=on -device
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
file=/dev/ssdGroup1/d84d39c7-beba-4895-83ac-d2718fdee3f3_disk,if=none,id=drive-virtio-disk0,format=raw,cache=none,iops_rd=20000,iops_wr=20000
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
-netdev tap,fd=27,id=hostnet0,vhost=on,vhostfd=28 -device
virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:4c:1a:66,bus=pci.0,addr=0x3
-chardev file,id=charserial0,path=/var/lib/nova/instances/d84d39c7-beba-4895-83ac-d2718fdee3f3/console.log
-device isa-serial,chardev=charserial0,id=serial0 -chardev
pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1
-device usb-tablet,id=input0 -vnc 0.0.0.0:0 -k en-us -device
cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on
root     18449  0.0  0.0  11748  2184 pts/3    S+   15:09   0:00 grep
--color=auto qemu



The virtual disk part for raw file:

   -drive file=/var/lib/nova/instances/9426ece6-6c1f-403d-a7bf-fa8fb975b321/disk,if=none,id=drive-virtio-disk0,format=raw,cache=none,iops_rd=20000,iops_wr=20000

The virtual disk part for LVM:

   -drive file=/dev/ssdGroup1/d84d39c7-beba-4895-83ac-d2718fdee3f3_disk,if=none,id=drive-virtio-disk0,format=raw,cache=none,iops_rd=20000,iops_wr=20000

Cheers,
Prema

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: EXT4 vs LVM performance for VMs
  2016-02-12 14:13       ` Premysl Kouril
@ 2016-02-12 16:53         ` Theodore Ts'o
  2016-02-12 17:38           ` Premysl Kouril
  0 siblings, 1 reply; 14+ messages in thread
From: Theodore Ts'o @ 2016-02-12 16:53 UTC (permalink / raw)
  To: Premysl Kouril; +Cc: Andi Kleen, linux-fsdevel

On Fri, Feb 12, 2016 at 03:13:26PM +0100, Premysl Kouril wrote:
> The performance results which I posted in my original post are after
> all the block allocation is done. In other words: on the testing
> machine we do a first fio run which writes and allocates the testing
> file (during this first run the fio performance is actually much worse
> that what I reported, single VM performance is about 700 IOPS) and
> then we do second fio run and we take the benchmark numbers from this
> second run.

So you allocated the file using a random write workload, so the file
was probably not very contiguous.  Whereas when you allocated the LVM
volume, it was probably allocated contiguously.  You can use the
filefrag tool to see how fragmented the file might be.

All of this being said, what are you trying to do?  If you are happy
using LVM, feel free to use it.  If there are specific features that
you want out of the file system, it's best that you explicitly
identify what you want, and so we can minimize the cost of the
features of what you want.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: EXT4 vs LVM performance for VMs
  2016-02-12 16:53         ` Theodore Ts'o
@ 2016-02-12 17:38           ` Premysl Kouril
  2016-02-13  2:15             ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Premysl Kouril @ 2016-02-12 17:38 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Andi Kleen, linux-fsdevel

> All of this being said, what are you trying to do?  If you are happy
> using LVM, feel free to use it.  If there are specific features that
> you want out of the file system, it's best that you explicitly
> identify what you want, and so we can minimize the cost of the
> features of what you want.


We are trying to decide whether to use filesystem or LVM for VM
storage. It's not that we are happy with LVM - while it performs
better there are limitations on LVM side especially when it comes to
manageability (for example certain features in OpenStack do only fork
if VM is file-based).

So, in short, if we would make filesystem to perform better we would
rather use filesystem than LVM, (and we don't really have any special
requirements in terms of filesystem features).

And in order for us to make a good decision I wanted to ask community,
if our observations and resultant numbers make sense.

Cheers,
Prema

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: EXT4 vs LVM performance for VMs
  2016-02-12 17:38           ` Premysl Kouril
@ 2016-02-13  2:15             ` Dave Chinner
  2016-02-13 21:56               ` Sanidhya Kashyap
  2016-02-15 18:56               ` Premysl Kouril
  0 siblings, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2016-02-13  2:15 UTC (permalink / raw)
  To: Premysl Kouril; +Cc: Theodore Ts'o, Andi Kleen, linux-fsdevel

On Fri, Feb 12, 2016 at 06:38:47PM +0100, Premysl Kouril wrote:
> > All of this being said, what are you trying to do?  If you are happy
> > using LVM, feel free to use it.  If there are specific features that
> > you want out of the file system, it's best that you explicitly
> > identify what you want, and so we can minimize the cost of the
> > features of what you want.
> 
> 
> We are trying to decide whether to use filesystem or LVM for VM
> storage. It's not that we are happy with LVM - while it performs
> better there are limitations on LVM side especially when it comes to
> manageability (for example certain features in OpenStack do only fork
> if VM is file-based).
> 
> So, in short, if we would make filesystem to perform better we would
> rather use filesystem than LVM, (and we don't really have any special
> requirements in terms of filesystem features).
> 
> And in order for us to make a good decision I wanted to ask community,
> if our observations and resultant numbers make sense.

For ext4, this is what you are going to get.

How about you try XFS? After all, concurrent direct IO writes is
something it is rather good at.

i.e. use XFS in both your host and guest. Use raw image files on the
host, and to make things roughly even with LVM you'll want to
preallocate them. If you don't want to preallocate them (i.e. sparse
image files) set them up with an extent size hint of at least 1MB so
that it limits fragmentation of the image file.  Then configure qemu
to use cache=none for it's IO to the image file.

On the first write pass to the image file (in either case), you
should see ~70-80% of the native underlying device performance
because there is some overhead in either allocation (sparse image
file) or unwritten extent conversion (preallocated image file).
This, of course, asssumes you are not CPU limited in the QEMU
process by the addition CPU overhead of file block mapping in the
host filesystem vs raw block device IO.

On the second write pass you should see 98-99% of the native
underlying device performance (again with the assumption that CPU
overhead of the host filesystem isn't a limiting factor).

As an example, I have a block device that can sustain just under 36k
random 4k write IOPS on my host. I have an XFS filesystem (default
configs) on that 400GB block device. I created a sparse 500TB image
file using:

# xfs_io -f -c "extsize 1m" -c "truncate 500t" vm-500t.img

And push it into a 16p/16GB RAM guest via:

-drive file=/mnt/fast-ssd/vm-500t.img,if=virtio,cache=none,format=raw

and in the guest run mkfs.xfs with defaults and mount it with
defaults. Then I ran your fio test on that 5 times in a row:

write: io=3072.0MB, bw=106393KB/s, iops=26598, runt= 29567msec
write: io=3072.0MB, bw=141508KB/s, iops=35377, runt= 22230msec
write: io=3072.0MB, bw=141254KB/s, iops=35313, runt= 22270msec
write: io=3072.0MB, bw=141115KB/s, iops=35278, runt= 22292msec
write: io=3072.0MB, bw=141534KB/s, iops=35383, runt= 22226msec

The first run was 26k IOPS, the rest were at 35k IOPS as they
overwrite the same blocks in the image file. IOWs, first pass at 75%
of device capability, the rest at > 98% of the host measured device
capability. All tests reported the full io depth was being used in
the guest:

IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%

The guest OS measured about 30% CPU usage for a single fio run at
35k IOPS:

real    0m22.648s
user    0m1.678s
sys     0m8.175s

However, the QEMU process on the host required 4 entire CPUs to
sustain this IO load, roughly 50/50 user/system time. IOWs, a large
amount of the CPU overhead on such workloads is on the host side in
QEMU, not the guest.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: EXT4 vs LVM performance for VMs
  2016-02-13  2:15             ` Dave Chinner
@ 2016-02-13 21:56               ` Sanidhya Kashyap
  2016-02-13 23:40                 ` Jaegeuk Kim
  2016-02-14  0:01                 ` Dave Chinner
  2016-02-15 18:56               ` Premysl Kouril
  1 sibling, 2 replies; 14+ messages in thread
From: Sanidhya Kashyap @ 2016-02-13 21:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Premysl Kouril, Theodore Ts'o, Andi Kleen, linux-fsdevel,
	changwoo.m, taesoo, steffen.maass, changwoo, Kashyap, Sanidhya

We did quite extensive performance evaluation on file systems,
including ext4, XFS, btrfs, F2FS, and tmpfs, in terms of multi-core
scalability using micro-benchmarks and application benchmarks.

Your workload, i.e., multiple tasks are concurrently overwriting a
single file, whose file system blocks are previously written, is quite
similar to one of our benchmark.

Based on our analysis, none of the file systems supports concurrent
update of a file even when each task accesses different region of
a file. That is because all file systems hold a lock for an entire
file. Only one exception is the concurrent direct I/O of XFS.

I think that local file systems need to support the range-based
locking, which is common in parallel file systems, to improve
concurrency level of I/O operations, specifically write operations.

If you can split a single file image into multiple files, you can
increase the concurrency level of write operations a little bit.

For more details, please take a look at our paper draft:
  https://sslab.gtisc.gatech.edu/assets/papers/2016/min:fxmark-draft.pdf

Though our paper is in review, I think it is okay to share since
the review process is single-blinded. You can find our analysis on
overwrite operations at Section 5.1.2. Scalability behavior of current
file systems are summarized at Section 7.

On Fri, Feb 12, 2016 at 9:15 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Feb 12, 2016 at 06:38:47PM +0100, Premysl Kouril wrote:
>> > All of this being said, what are you trying to do?  If you are happy
>> > using LVM, feel free to use it.  If there are specific features that
>> > you want out of the file system, it's best that you explicitly
>> > identify what you want, and so we can minimize the cost of the
>> > features of what you want.
>>
>>
>> We are trying to decide whether to use filesystem or LVM for VM
>> storage. It's not that we are happy with LVM - while it performs
>> better there are limitations on LVM side especially when it comes to
>> manageability (for example certain features in OpenStack do only fork
>> if VM is file-based).
>>
>> So, in short, if we would make filesystem to perform better we would
>> rather use filesystem than LVM, (and we don't really have any special
>> requirements in terms of filesystem features).
>>
>> And in order for us to make a good decision I wanted to ask community,
>> if our observations and resultant numbers make sense.
>
> For ext4, this is what you are going to get.
>
> How about you try XFS? After all, concurrent direct IO writes is
> something it is rather good at.
>
> i.e. use XFS in both your host and guest. Use raw image files on the
> host, and to make things roughly even with LVM you'll want to
> preallocate them. If you don't want to preallocate them (i.e. sparse
> image files) set them up with an extent size hint of at least 1MB so
> that it limits fragmentation of the image file.  Then configure qemu
> to use cache=none for it's IO to the image file.
>
> On the first write pass to the image file (in either case), you
> should see ~70-80% of the native underlying device performance
> because there is some overhead in either allocation (sparse image
> file) or unwritten extent conversion (preallocated image file).
> This, of course, asssumes you are not CPU limited in the QEMU
> process by the addition CPU overhead of file block mapping in the
> host filesystem vs raw block device IO.
>
> On the second write pass you should see 98-99% of the native
> underlying device performance (again with the assumption that CPU
> overhead of the host filesystem isn't a limiting factor).
>
> As an example, I have a block device that can sustain just under 36k
> random 4k write IOPS on my host. I have an XFS filesystem (default
> configs) on that 400GB block device. I created a sparse 500TB image
> file using:
>
> # xfs_io -f -c "extsize 1m" -c "truncate 500t" vm-500t.img
>
> And push it into a 16p/16GB RAM guest via:
>
> -drive file=/mnt/fast-ssd/vm-500t.img,if=virtio,cache=none,format=raw
>
> and in the guest run mkfs.xfs with defaults and mount it with
> defaults. Then I ran your fio test on that 5 times in a row:
>
> write: io=3072.0MB, bw=106393KB/s, iops=26598, runt= 29567msec
> write: io=3072.0MB, bw=141508KB/s, iops=35377, runt= 22230msec
> write: io=3072.0MB, bw=141254KB/s, iops=35313, runt= 22270msec
> write: io=3072.0MB, bw=141115KB/s, iops=35278, runt= 22292msec
> write: io=3072.0MB, bw=141534KB/s, iops=35383, runt= 22226msec
>
> The first run was 26k IOPS, the rest were at 35k IOPS as they
> overwrite the same blocks in the image file. IOWs, first pass at 75%
> of device capability, the rest at > 98% of the host measured device
> capability. All tests reported the full io depth was being used in
> the guest:
>
> IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>
> The guest OS measured about 30% CPU usage for a single fio run at
> 35k IOPS:
>
> real    0m22.648s
> user    0m1.678s
> sys     0m8.175s
>
> However, the QEMU process on the host required 4 entire CPUs to
> sustain this IO load, roughly 50/50 user/system time. IOWs, a large
> amount of the CPU overhead on such workloads is on the host side in
> QEMU, not the guest.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: EXT4 vs LVM performance for VMs
  2016-02-13 21:56               ` Sanidhya Kashyap
@ 2016-02-13 23:40                 ` Jaegeuk Kim
  2016-02-14  0:01                 ` Dave Chinner
  1 sibling, 0 replies; 14+ messages in thread
From: Jaegeuk Kim @ 2016-02-13 23:40 UTC (permalink / raw)
  To: Sanidhya Kashyap
  Cc: Dave Chinner, Premysl Kouril, Theodore Ts'o, Andi Kleen,
	linux-fsdevel, changwoo.m, taesoo, steffen.maass, changwoo,
	Kashyap, Sanidhya

Hi Sanidhya,

It's a very interesting paper to me. Thank you for sharing that.

Looking at a glance, I have a question about F2FS where the paper concludes
that F2FS serializes every writes.
But, I don't agree to that, since cp_rwsem, a rw_semaphore, is used to cease
from all the fs operations to perform a checkpoint.
Other than that case, every operations including writes just grab read_sem,
so there should be no serialization.
It seems there is no sync/fsync contention in the workloads.

Thanks,

On Sat, Feb 13, 2016 at 04:56:18PM -0500, Sanidhya Kashyap wrote:
> We did quite extensive performance evaluation on file systems,
> including ext4, XFS, btrfs, F2FS, and tmpfs, in terms of multi-core
> scalability using micro-benchmarks and application benchmarks.
> 
> Your workload, i.e., multiple tasks are concurrently overwriting a
> single file, whose file system blocks are previously written, is quite
> similar to one of our benchmark.
> 
> Based on our analysis, none of the file systems supports concurrent
> update of a file even when each task accesses different region of
> a file. That is because all file systems hold a lock for an entire
> file. Only one exception is the concurrent direct I/O of XFS.
> 
> I think that local file systems need to support the range-based
> locking, which is common in parallel file systems, to improve
> concurrency level of I/O operations, specifically write operations.
> 
> If you can split a single file image into multiple files, you can
> increase the concurrency level of write operations a little bit.
> 
> For more details, please take a look at our paper draft:
>   https://sslab.gtisc.gatech.edu/assets/papers/2016/min:fxmark-draft.pdf
> 
> Though our paper is in review, I think it is okay to share since
> the review process is single-blinded. You can find our analysis on
> overwrite operations at Section 5.1.2. Scalability behavior of current
> file systems are summarized at Section 7.
> 
> On Fri, Feb 12, 2016 at 9:15 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Feb 12, 2016 at 06:38:47PM +0100, Premysl Kouril wrote:
> >> > All of this being said, what are you trying to do?  If you are happy
> >> > using LVM, feel free to use it.  If there are specific features that
> >> > you want out of the file system, it's best that you explicitly
> >> > identify what you want, and so we can minimize the cost of the
> >> > features of what you want.
> >>
> >>
> >> We are trying to decide whether to use filesystem or LVM for VM
> >> storage. It's not that we are happy with LVM - while it performs
> >> better there are limitations on LVM side especially when it comes to
> >> manageability (for example certain features in OpenStack do only fork
> >> if VM is file-based).
> >>
> >> So, in short, if we would make filesystem to perform better we would
> >> rather use filesystem than LVM, (and we don't really have any special
> >> requirements in terms of filesystem features).
> >>
> >> And in order for us to make a good decision I wanted to ask community,
> >> if our observations and resultant numbers make sense.
> >
> > For ext4, this is what you are going to get.
> >
> > How about you try XFS? After all, concurrent direct IO writes is
> > something it is rather good at.
> >
> > i.e. use XFS in both your host and guest. Use raw image files on the
> > host, and to make things roughly even with LVM you'll want to
> > preallocate them. If you don't want to preallocate them (i.e. sparse
> > image files) set them up with an extent size hint of at least 1MB so
> > that it limits fragmentation of the image file.  Then configure qemu
> > to use cache=none for it's IO to the image file.
> >
> > On the first write pass to the image file (in either case), you
> > should see ~70-80% of the native underlying device performance
> > because there is some overhead in either allocation (sparse image
> > file) or unwritten extent conversion (preallocated image file).
> > This, of course, asssumes you are not CPU limited in the QEMU
> > process by the addition CPU overhead of file block mapping in the
> > host filesystem vs raw block device IO.
> >
> > On the second write pass you should see 98-99% of the native
> > underlying device performance (again with the assumption that CPU
> > overhead of the host filesystem isn't a limiting factor).
> >
> > As an example, I have a block device that can sustain just under 36k
> > random 4k write IOPS on my host. I have an XFS filesystem (default
> > configs) on that 400GB block device. I created a sparse 500TB image
> > file using:
> >
> > # xfs_io -f -c "extsize 1m" -c "truncate 500t" vm-500t.img
> >
> > And push it into a 16p/16GB RAM guest via:
> >
> > -drive file=/mnt/fast-ssd/vm-500t.img,if=virtio,cache=none,format=raw
> >
> > and in the guest run mkfs.xfs with defaults and mount it with
> > defaults. Then I ran your fio test on that 5 times in a row:
> >
> > write: io=3072.0MB, bw=106393KB/s, iops=26598, runt= 29567msec
> > write: io=3072.0MB, bw=141508KB/s, iops=35377, runt= 22230msec
> > write: io=3072.0MB, bw=141254KB/s, iops=35313, runt= 22270msec
> > write: io=3072.0MB, bw=141115KB/s, iops=35278, runt= 22292msec
> > write: io=3072.0MB, bw=141534KB/s, iops=35383, runt= 22226msec
> >
> > The first run was 26k IOPS, the rest were at 35k IOPS as they
> > overwrite the same blocks in the image file. IOWs, first pass at 75%
> > of device capability, the rest at > 98% of the host measured device
> > capability. All tests reported the full io depth was being used in
> > the guest:
> >
> > IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
> >
> > The guest OS measured about 30% CPU usage for a single fio run at
> > 35k IOPS:
> >
> > real    0m22.648s
> > user    0m1.678s
> > sys     0m8.175s
> >
> > However, the QEMU process on the host required 4 entire CPUs to
> > sustain this IO load, roughly 50/50 user/system time. IOWs, a large
> > amount of the CPU overhead on such workloads is on the host side in
> > QEMU, not the guest.
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: EXT4 vs LVM performance for VMs
  2016-02-13 21:56               ` Sanidhya Kashyap
  2016-02-13 23:40                 ` Jaegeuk Kim
@ 2016-02-14  0:01                 ` Dave Chinner
  1 sibling, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2016-02-14  0:01 UTC (permalink / raw)
  To: Sanidhya Kashyap
  Cc: Premysl Kouril, Theodore Ts'o, Andi Kleen, linux-fsdevel,
	changwoo.m, taesoo, steffen.maass, changwoo, Kashyap, Sanidhya

On Sat, Feb 13, 2016 at 04:56:18PM -0500, Sanidhya Kashyap wrote:
> We did quite extensive performance evaluation on file systems,
> including ext4, XFS, btrfs, F2FS, and tmpfs, in terms of multi-core
> scalability using micro-benchmarks and application benchmarks.
> 
> Your workload, i.e., multiple tasks are concurrently overwriting a
> single file, whose file system blocks are previously written, is quite
> similar to one of our benchmark.
> 
> Based on our analysis, none of the file systems supports concurrent
> update of a file even when each task accesses different region of
> a file. That is because all file systems hold a lock for an entire
> file. Only one exception is the concurrent direct I/O of XFS.
> 
> I think that local file systems need to support the range-based
> locking, which is common in parallel file systems, to improve
> concurrency level of I/O operations, specifically write operations.

Yes, we've spent a fair bit of time talking about that (pretty sure
it was a topic of discussion at last year's LFSMM developer
conference), but it really isn't a simply thing to add to the VFS or
most filesystems.

> If you can split a single file image into multiple files, you can
> increase the concurrency level of write operations a little bit.

At the cost of increased storage stack complexity. most people don't
need extreme performance in their VMs, so a single file is generally
adequate on XFS.

> For more details, please take a look at our paper draft:
>   https://sslab.gtisc.gatech.edu/assets/papers/2016/min:fxmark-draft.pdf
> 
> Though our paper is in review, I think it is okay to share since
> the review process is single-blinded. You can find our analysis on
> overwrite operations at Section 5.1.2. Scalability behavior of current
> file systems are summarized at Section 7.

It's a nice summary of the issues, but there are no surprises in the
paper. i.e. It's all things we already know about and, in
some cases, are already looking at solutions (e.g.  per-node/per-cpu
lists to address inode_sb_list_lock contention, potential for
converting i_mutex to an rwsem to allow shared read-only access to
directories, etc).

The only thing that surprised me is how badly rwsems degrade when
contended on large machines. I've done local benchmarks on 16p
machines with single file direct IO and pushed to being CPU bound
I've measured over 2 million single sector random read IOPS, 1.5
million random overwrite IOPS, and ~800k random write w/ allocate
IOPS.  IOWs, the IO scalability is there when the lock doesn't
degrade (which really is a core OS issue, not so much a fs issue).

A couple of things I noticed in the summary:

"High locality can cause performance collapse"

You imply filesystems try to maintain high locality to improve cache
hit rates.  Filesystems try to maintain locality in disk allocation
to minimise seek time for physical IO on related structures to
maintain good performance when /cache misses occur/. IOWs, the
scalability of the in-memory caches is completely unrelated to the
"high locality" optimisations that filesystem make...

"because XFS holds a per-device lock instead of a per-file lock in
an O_DIRECT mode"

That's a new one - I've never heard anyone say that about XFS (and
I've heard a lot of wacky things about XFS!). It's much simpler than
that - we don't use the i_mutex in O_DIRECT mode, and instead uses
shared read locking on the per-inode IO lock for all IO operations.

"Overwriting is as expensive as appending"

You shouldn't make generalisations that don't apply generally to the
the filesystems you tested. :P

FWIW, log->l_icloglock contention in XFS implies the application has
an excessive fsync problem - that's the only way that lock can see
any sort of significant concurrent access.  It's probably just the
case that the old-school algorithm the code uses to wait for journal
IO completion was never expected to scale to operations on storage
that can sustain millions of IOPS.

I'll add it to the list of known journalling scalabiity bottlenecks
in XFS - there's a lot more issues than your testing has told you
about.... :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: EXT4 vs LVM performance for VMs
  2016-02-13  2:15             ` Dave Chinner
  2016-02-13 21:56               ` Sanidhya Kashyap
@ 2016-02-15 18:56               ` Premysl Kouril
  2016-02-15 19:11                 ` Liu Bo
  2016-02-15 23:10                 ` Dave Chinner
  1 sibling, 2 replies; 14+ messages in thread
From: Premysl Kouril @ 2016-02-15 18:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Theodore Ts'o, Andi Kleen, linux-fsdevel

Hello Dave,

thanks for your suggestion. I've just recreated our tests with the XFS
and preallocated raw files and the results seem almost same as with
the EXT4. I again checked stuff with my Systemtap script and threads
are waiting mostly waiting for locks in following placeS:

The main KVM thread:


TID: 4532 waited 5135787 ns here:
 0xffffffffc18a815b : 0xffffffffc18a815b
[stap_eb6b67472fc672bbb457a915ab0fb97_10634+0x915b/0x0]
 0xffffffffc18a964b : 0xffffffffc18a964b
[stap_eb6b67472fc672bbb457a915ab0fb97_10634+0xa64b/0x0]
 0xffffffffc18ab07a : 0xffffffffc18ab07a
[stap_eb6b67472fc672bbb457a915ab0fb97_10634+0xc07a/0x0]
 0xffffffffc189f014 : 0xffffffffc189f014
[stap_eb6b67472fc672bbb457a915ab0fb97_10634+0x14/0x0]
 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel]
 0xffffffff8176d839 : schedule+0x29/0x70 [kernel]
 0xffffffffc0857c2d : 0xffffffffc0857c2d [kvm]
 0xffffffff810b6250 : autoremove_wake_function+0x0/0x40 [kernel]
 0xffffffffc0870f41 : 0xffffffffc0870f41 [kvm]
 0xffffffffc085ace2 : 0xffffffffc085ace2 [kvm]
 0xffffffff812132d8 : fsnotify+0x228/0x2f0 [kernel]
 0xffffffff810e7e9a : do_futex+0x10a/0x6a0 [kernel]
 0xffffffff811e8890 : do_vfs_ioctl+0x2e0/0x4c0 [kernel]
 0xffffffffc0864ce4 : 0xffffffffc0864ce4 [kvm]
 0xffffffff811e8af1 : sys_ioctl+0x81/0xa0 [kernel]
 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel]



Worker threads (KVM worker or kernel worker):


TID: 12139 waited 7939986 here:
 0xffffffffc1e4f15b : 0xffffffffc1e4f15b
[stap_18a0b929d6dbf8774f2b8457b465301_12541+0x915b/0x0]
 0xffffffffc1e5065b : 0xffffffffc1e5065b
[stap_18a0b929d6dbf8774f2b8457b465301_12541+0xa65b/0x0]
 0xffffffffc1e5209a : 0xffffffffc1e5209a
[stap_18a0b929d6dbf8774f2b8457b465301_12541+0xc09a/0x0]
 0xffffffffc1e46014 : 0xffffffffc1e46014
[stap_18a0b929d6dbf8774f2b8457b465301_12541+0x14/0x0]
 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel]
 0xffffffff8176d839 : schedule+0x29/0x70 [kernel]
 0xffffffff810e511e : futex_wait_queue_me+0xde/0x140 [kernel]
 0xffffffff810e5c42 : futex_wait+0x182/0x290 [kernel]
 0xffffffff810e7e6e : do_futex+0xde/0x6a0 [kernel]
 0xffffffff810e84a1 : SyS_futex+0x71/0x150 [kernel]
 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel]



TID: 12139 waited 11219902 here:
 0xffffffffc1e4f15b : 0xffffffffc1e4f15b
[stap_18a0b929d6dbf8774f2b8457b465301_12541+0x915b/0x0]
 0xffffffffc1e5065b : 0xffffffffc1e5065b
[stap_18a0b929d6dbf8774f2b8457b465301_12541+0xa65b/0x0]
 0xffffffffc1e5209a : 0xffffffffc1e5209a
[stap_18a0b929d6dbf8774f2b8457b465301_12541+0xc09a/0x0]
 0xffffffffc1e46014 : 0xffffffffc1e46014
[stap_18a0b929d6dbf8774f2b8457b465301_12541+0x14/0x0]
 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel]
 0xffffffff8176dd49 : schedule_preempt_disabled+0x29/0x70 [kernel]
 0xffffffff8176fb95 : __mutex_lock_slowpath+0xd5/0x1d0 [kernel]
 0xffffffff8176fcaf : mutex_lock+0x1f/0x2f [kernel]
 0xffffffffc073c531 : 0xffffffffc073c531 [xfs]
 0xffffffffc07a788a : 0xffffffffc07a788a [xfs]
 0xffffffffc073dd4c : 0xffffffffc073dd4c [xfs]
 0xffffffff811d5251 : new_sync_write+0x81/0xb0 [kernel]
 0xffffffff811d5a07 : vfs_write+0xb7/0x1f0 [kernel]
 0xffffffff811d6732 : sys_pwrite64+0x72/0xb0 [kernel]
 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel]




Looking at this, does it suggest that the bottleneck is locking on the
VFS layer? Or does my setup actually do DirectIO on the host level?
You and Sanidhya mentioned that XFS is good at concurrent DirectIO as
it doesn't hold lock on file, but I do see this in the trace:

 0xffffffff8176fcaf : mutex_lock+0x1f/0x2f [kernel]
 0xffffffffc073c531 : 0xffffffffc073c531 [xfs]
 0xffffffffc07a788a : 0xffffffffc07a788a [xfs]
 0xffffffffc073dd4c : 0xffffffffc073dd4c [xfs]

So either KVM is not doing directIO or there is some lock xfs must
hold to do the write, right?


Regards,
Premysl Kouril







On Sat, Feb 13, 2016 at 3:15 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Feb 12, 2016 at 06:38:47PM +0100, Premysl Kouril wrote:
>> > All of this being said, what are you trying to do?  If you are happy
>> > using LVM, feel free to use it.  If there are specific features that
>> > you want out of the file system, it's best that you explicitly
>> > identify what you want, and so we can minimize the cost of the
>> > features of what you want.
>>
>>
>> We are trying to decide whether to use filesystem or LVM for VM
>> storage. It's not that we are happy with LVM - while it performs
>> better there are limitations on LVM side especially when it comes to
>> manageability (for example certain features in OpenStack do only fork
>> if VM is file-based).
>>
>> So, in short, if we would make filesystem to perform better we would
>> rather use filesystem than LVM, (and we don't really have any special
>> requirements in terms of filesystem features).
>>
>> And in order for us to make a good decision I wanted to ask community,
>> if our observations and resultant numbers make sense.
>
> For ext4, this is what you are going to get.
>
> How about you try XFS? After all, concurrent direct IO writes is
> something it is rather good at.
>
> i.e. use XFS in both your host and guest. Use raw image files on the
> host, and to make things roughly even with LVM you'll want to
> preallocate them. If you don't want to preallocate them (i.e. sparse
> image files) set them up with an extent size hint of at least 1MB so
> that it limits fragmentation of the image file.  Then configure qemu
> to use cache=none for it's IO to the image file.
>
> On the first write pass to the image file (in either case), you
> should see ~70-80% of the native underlying device performance
> because there is some overhead in either allocation (sparse image
> file) or unwritten extent conversion (preallocated image file).
> This, of course, asssumes you are not CPU limited in the QEMU
> process by the addition CPU overhead of file block mapping in the
> host filesystem vs raw block device IO.
>
> On the second write pass you should see 98-99% of the native
> underlying device performance (again with the assumption that CPU
> overhead of the host filesystem isn't a limiting factor).
>
> As an example, I have a block device that can sustain just under 36k
> random 4k write IOPS on my host. I have an XFS filesystem (default
> configs) on that 400GB block device. I created a sparse 500TB image
> file using:
>
> # xfs_io -f -c "extsize 1m" -c "truncate 500t" vm-500t.img
>
> And push it into a 16p/16GB RAM guest via:
>
> -drive file=/mnt/fast-ssd/vm-500t.img,if=virtio,cache=none,format=raw
>
> and in the guest run mkfs.xfs with defaults and mount it with
> defaults. Then I ran your fio test on that 5 times in a row:
>
> write: io=3072.0MB, bw=106393KB/s, iops=26598, runt= 29567msec
> write: io=3072.0MB, bw=141508KB/s, iops=35377, runt= 22230msec
> write: io=3072.0MB, bw=141254KB/s, iops=35313, runt= 22270msec
> write: io=3072.0MB, bw=141115KB/s, iops=35278, runt= 22292msec
> write: io=3072.0MB, bw=141534KB/s, iops=35383, runt= 22226msec
>
> The first run was 26k IOPS, the rest were at 35k IOPS as they
> overwrite the same blocks in the image file. IOWs, first pass at 75%
> of device capability, the rest at > 98% of the host measured device
> capability. All tests reported the full io depth was being used in
> the guest:
>
> IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>
> The guest OS measured about 30% CPU usage for a single fio run at
> 35k IOPS:
>
> real    0m22.648s
> user    0m1.678s
> sys     0m8.175s
>
> However, the QEMU process on the host required 4 entire CPUs to
> sustain this IO load, roughly 50/50 user/system time. IOWs, a large
> amount of the CPU overhead on such workloads is on the host side in
> QEMU, not the guest.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: EXT4 vs LVM performance for VMs
  2016-02-15 18:56               ` Premysl Kouril
@ 2016-02-15 19:11                 ` Liu Bo
  2016-02-15 23:10                 ` Dave Chinner
  1 sibling, 0 replies; 14+ messages in thread
From: Liu Bo @ 2016-02-15 19:11 UTC (permalink / raw)
  To: Premysl Kouril; +Cc: Dave Chinner, Theodore Ts'o, Andi Kleen, linux-fsdevel

On Mon, Feb 15, 2016 at 07:56:05PM +0100, Premysl Kouril wrote:
> Hello Dave,
> 
> thanks for your suggestion. I've just recreated our tests with the XFS
> and preallocated raw files and the results seem almost same as with
> the EXT4. I again checked stuff with my Systemtap script and threads
> are waiting mostly waiting for locks in following placeS:
> 
> The main KVM thread:
> 
> 
> TID: 4532 waited 5135787 ns here:
>  0xffffffffc18a815b : 0xffffffffc18a815b
> [stap_eb6b67472fc672bbb457a915ab0fb97_10634+0x915b/0x0]
>  0xffffffffc18a964b : 0xffffffffc18a964b
> [stap_eb6b67472fc672bbb457a915ab0fb97_10634+0xa64b/0x0]
>  0xffffffffc18ab07a : 0xffffffffc18ab07a
> [stap_eb6b67472fc672bbb457a915ab0fb97_10634+0xc07a/0x0]
>  0xffffffffc189f014 : 0xffffffffc189f014
> [stap_eb6b67472fc672bbb457a915ab0fb97_10634+0x14/0x0]
>  0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel]
>  0xffffffff8176d839 : schedule+0x29/0x70 [kernel]
>  0xffffffffc0857c2d : 0xffffffffc0857c2d [kvm]
>  0xffffffff810b6250 : autoremove_wake_function+0x0/0x40 [kernel]
>  0xffffffffc0870f41 : 0xffffffffc0870f41 [kvm]
>  0xffffffffc085ace2 : 0xffffffffc085ace2 [kvm]
>  0xffffffff812132d8 : fsnotify+0x228/0x2f0 [kernel]
>  0xffffffff810e7e9a : do_futex+0x10a/0x6a0 [kernel]
>  0xffffffff811e8890 : do_vfs_ioctl+0x2e0/0x4c0 [kernel]
>  0xffffffffc0864ce4 : 0xffffffffc0864ce4 [kvm]
>  0xffffffff811e8af1 : sys_ioctl+0x81/0xa0 [kernel]
>  0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel]
> 
> 
> 
> Worker threads (KVM worker or kernel worker):
> 
> 
> TID: 12139 waited 7939986 here:
>  0xffffffffc1e4f15b : 0xffffffffc1e4f15b
> [stap_18a0b929d6dbf8774f2b8457b465301_12541+0x915b/0x0]
>  0xffffffffc1e5065b : 0xffffffffc1e5065b
> [stap_18a0b929d6dbf8774f2b8457b465301_12541+0xa65b/0x0]
>  0xffffffffc1e5209a : 0xffffffffc1e5209a
> [stap_18a0b929d6dbf8774f2b8457b465301_12541+0xc09a/0x0]
>  0xffffffffc1e46014 : 0xffffffffc1e46014
> [stap_18a0b929d6dbf8774f2b8457b465301_12541+0x14/0x0]
>  0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel]
>  0xffffffff8176d839 : schedule+0x29/0x70 [kernel]
>  0xffffffff810e511e : futex_wait_queue_me+0xde/0x140 [kernel]
>  0xffffffff810e5c42 : futex_wait+0x182/0x290 [kernel]
>  0xffffffff810e7e6e : do_futex+0xde/0x6a0 [kernel]
>  0xffffffff810e84a1 : SyS_futex+0x71/0x150 [kernel]
>  0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel]
> 
> 
> 
> TID: 12139 waited 11219902 here:
>  0xffffffffc1e4f15b : 0xffffffffc1e4f15b
> [stap_18a0b929d6dbf8774f2b8457b465301_12541+0x915b/0x0]
>  0xffffffffc1e5065b : 0xffffffffc1e5065b
> [stap_18a0b929d6dbf8774f2b8457b465301_12541+0xa65b/0x0]
>  0xffffffffc1e5209a : 0xffffffffc1e5209a
> [stap_18a0b929d6dbf8774f2b8457b465301_12541+0xc09a/0x0]
>  0xffffffffc1e46014 : 0xffffffffc1e46014
> [stap_18a0b929d6dbf8774f2b8457b465301_12541+0x14/0x0]
>  0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel]
>  0xffffffff8176dd49 : schedule_preempt_disabled+0x29/0x70 [kernel]
>  0xffffffff8176fb95 : __mutex_lock_slowpath+0xd5/0x1d0 [kernel]
>  0xffffffff8176fcaf : mutex_lock+0x1f/0x2f [kernel]
>  0xffffffffc073c531 : 0xffffffffc073c531 [xfs]
>  0xffffffffc07a788a : 0xffffffffc07a788a [xfs]
>  0xffffffffc073dd4c : 0xffffffffc073dd4c [xfs]
>  0xffffffff811d5251 : new_sync_write+0x81/0xb0 [kernel]
>  0xffffffff811d5a07 : vfs_write+0xb7/0x1f0 [kernel]
>  0xffffffff811d6732 : sys_pwrite64+0x72/0xb0 [kernel]
>  0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel]
> 
> 
> 
> 
> Looking at this, does it suggest that the bottleneck is locking on the
> VFS layer? Or does my setup actually do DirectIO on the host level?
> You and Sanidhya mentioned that XFS is good at concurrent DirectIO as
> it doesn't hold lock on file, but I do see this in the trace:
> 
>  0xffffffff8176fcaf : mutex_lock+0x1f/0x2f [kernel]
>  0xffffffffc073c531 : 0xffffffffc073c531 [xfs]
>  0xffffffffc07a788a : 0xffffffffc07a788a [xfs]
>  0xffffffffc073dd4c : 0xffffffffc073dd4c [xfs]
> 
> So either KVM is not doing directIO or there is some lock xfs must
> hold to do the write, right?

Is this gathered when qemu is binded to single CPU?

fio takes iodepth=64, but blk-mq uses per-cpu or per-node queue.

Not sure if blk-mq is available on 3.16.0.

Thanks,

-liubo

> 
> 
> Regards,
> Premysl Kouril
> 
> 
> 
> 
> 
> 
> 
> On Sat, Feb 13, 2016 at 3:15 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Feb 12, 2016 at 06:38:47PM +0100, Premysl Kouril wrote:
> >> > All of this being said, what are you trying to do?  If you are happy
> >> > using LVM, feel free to use it.  If there are specific features that
> >> > you want out of the file system, it's best that you explicitly
> >> > identify what you want, and so we can minimize the cost of the
> >> > features of what you want.
> >>
> >>
> >> We are trying to decide whether to use filesystem or LVM for VM
> >> storage. It's not that we are happy with LVM - while it performs
> >> better there are limitations on LVM side especially when it comes to
> >> manageability (for example certain features in OpenStack do only fork
> >> if VM is file-based).
> >>
> >> So, in short, if we would make filesystem to perform better we would
> >> rather use filesystem than LVM, (and we don't really have any special
> >> requirements in terms of filesystem features).
> >>
> >> And in order for us to make a good decision I wanted to ask community,
> >> if our observations and resultant numbers make sense.
> >
> > For ext4, this is what you are going to get.
> >
> > How about you try XFS? After all, concurrent direct IO writes is
> > something it is rather good at.
> >
> > i.e. use XFS in both your host and guest. Use raw image files on the
> > host, and to make things roughly even with LVM you'll want to
> > preallocate them. If you don't want to preallocate them (i.e. sparse
> > image files) set them up with an extent size hint of at least 1MB so
> > that it limits fragmentation of the image file.  Then configure qemu
> > to use cache=none for it's IO to the image file.
> >
> > On the first write pass to the image file (in either case), you
> > should see ~70-80% of the native underlying device performance
> > because there is some overhead in either allocation (sparse image
> > file) or unwritten extent conversion (preallocated image file).
> > This, of course, asssumes you are not CPU limited in the QEMU
> > process by the addition CPU overhead of file block mapping in the
> > host filesystem vs raw block device IO.
> >
> > On the second write pass you should see 98-99% of the native
> > underlying device performance (again with the assumption that CPU
> > overhead of the host filesystem isn't a limiting factor).
> >
> > As an example, I have a block device that can sustain just under 36k
> > random 4k write IOPS on my host. I have an XFS filesystem (default
> > configs) on that 400GB block device. I created a sparse 500TB image
> > file using:
> >
> > # xfs_io -f -c "extsize 1m" -c "truncate 500t" vm-500t.img
> >
> > And push it into a 16p/16GB RAM guest via:
> >
> > -drive file=/mnt/fast-ssd/vm-500t.img,if=virtio,cache=none,format=raw
> >
> > and in the guest run mkfs.xfs with defaults and mount it with
> > defaults. Then I ran your fio test on that 5 times in a row:
> >
> > write: io=3072.0MB, bw=106393KB/s, iops=26598, runt= 29567msec
> > write: io=3072.0MB, bw=141508KB/s, iops=35377, runt= 22230msec
> > write: io=3072.0MB, bw=141254KB/s, iops=35313, runt= 22270msec
> > write: io=3072.0MB, bw=141115KB/s, iops=35278, runt= 22292msec
> > write: io=3072.0MB, bw=141534KB/s, iops=35383, runt= 22226msec
> >
> > The first run was 26k IOPS, the rest were at 35k IOPS as they
> > overwrite the same blocks in the image file. IOWs, first pass at 75%
> > of device capability, the rest at > 98% of the host measured device
> > capability. All tests reported the full io depth was being used in
> > the guest:
> >
> > IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
> >
> > The guest OS measured about 30% CPU usage for a single fio run at
> > 35k IOPS:
> >
> > real    0m22.648s
> > user    0m1.678s
> > sys     0m8.175s
> >
> > However, the QEMU process on the host required 4 entire CPUs to
> > sustain this IO load, roughly 50/50 user/system time. IOWs, a large
> > amount of the CPU overhead on such workloads is on the host side in
> > QEMU, not the guest.
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: EXT4 vs LVM performance for VMs
  2016-02-15 18:56               ` Premysl Kouril
  2016-02-15 19:11                 ` Liu Bo
@ 2016-02-15 23:10                 ` Dave Chinner
  1 sibling, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2016-02-15 23:10 UTC (permalink / raw)
  To: Premysl Kouril; +Cc: Theodore Ts'o, Andi Kleen, linux-fsdevel

On Mon, Feb 15, 2016 at 07:56:05PM +0100, Premysl Kouril wrote:
> Hello Dave,
> 
> thanks for your suggestion. I've just recreated our tests with the XFS
> and preallocated raw files and the results seem almost same as with
> the EXT4. I again checked stuff with my Systemtap script and threads
> are waiting mostly waiting for locks in following placeS:
....
> Looking at this, does it suggest that the bottleneck is locking on the
> VFS layer? Or does my setup actually do DirectIO on the host level?
> You and Sanidhya mentioned that XFS is good at concurrent DirectIO as
> it doesn't hold lock on file, but I do see this in the trace:
> 
>  0xffffffff8176fcaf : mutex_lock+0x1f/0x2f [kernel]
>  0xffffffffc073c531 : 0xffffffffc073c531 [xfs]
>  0xffffffffc07a788a : 0xffffffffc07a788a [xfs]
>  0xffffffffc073dd4c : 0xffffffffc073dd4c [xfs]

You need to resolve these addresses to symbols so we can see where
this is actually blocking.

> So either KVM is not doing directIO or there is some lock xfs must
> hold to do the write, right?

My guess is that you didn't configure KVM to use direct Io
correctly, because if it was XFs blocking on internal locks in
direct IO it would be on a rwsem, not a mutex.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-02-15 23:10 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-11 20:50 EXT4 vs LVM performance for VMs Premysl Kouril
2016-02-12  6:12 ` Andi Kleen
2016-02-12  9:09   ` Premysl Kouril
2016-02-12 13:38     ` Theodore Ts'o
2016-02-12 14:13       ` Premysl Kouril
2016-02-12 16:53         ` Theodore Ts'o
2016-02-12 17:38           ` Premysl Kouril
2016-02-13  2:15             ` Dave Chinner
2016-02-13 21:56               ` Sanidhya Kashyap
2016-02-13 23:40                 ` Jaegeuk Kim
2016-02-14  0:01                 ` Dave Chinner
2016-02-15 18:56               ` Premysl Kouril
2016-02-15 19:11                 ` Liu Bo
2016-02-15 23:10                 ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.