From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mail-wm0-f42.google.com ([74.125.82.42]:37183 "EHLO
	mail-wm0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751959AbcBOS4H (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Mon, 15 Feb 2016 13:56:07 -0500
Received: by mail-wm0-f42.google.com with SMTP id g62so121306728wme.0
        for <linux-fsdevel@vger.kernel.org>; Mon, 15 Feb 2016 10:56:06 -0800 (PST)
MIME-Version: 1.0
In-Reply-To: <20160213021509.GO19486@dastard>
References: <CA+rfnd+HnbFLR16iRyvAuFwwAWmBDM2SNNGodMbeKZf-G_9p2A@mail.gmail.com>
	<87twlee9to.fsf@tassilo.jf.intel.com>
	<CA+rfndKXm4mTEztjijS4x9bBEzsY8fow7SuKHUcYmwrR+7XNvA@mail.gmail.com>
	<20160212133825.GJ11298@thunk.org>
	<CA+rfndJv2gUu=vxUktn3hWhUKHRv80gLgaRBCYRFMgRHcVcS1A@mail.gmail.com>
	<20160212165319.GB7928@thunk.org>
	<CA+rfndKf0a2GHjsuqHozPk83T2acTV++01i6fVUCakr8LrLKJA@mail.gmail.com>
	<20160213021509.GO19486@dastard>
Date: Mon, 15 Feb 2016 19:56:05 +0100
Message-ID: <CA+rfndJW=xXZKFh1Aag8DyCgW5b5Gz_XOOm0YsLP0nNnyui__w@mail.gmail.com>
Subject: Re: EXT4 vs LVM performance for VMs
From: Premysl Kouril <premysl.kouril@gmail.com>
To: Dave Chinner <david@fromorbit.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>, Andi Kleen <andi@firstfloor.org>,
	linux-fsdevel@vger.kernel.org
Content-Type: text/plain; charset=UTF-8
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

Hello Dave,

thanks for your suggestion. I've just recreated our tests with the XFS
and preallocated raw files and the results seem almost same as with
the EXT4. I again checked stuff with my Systemtap script and threads
are waiting mostly waiting for locks in following placeS:

The main KVM thread:


TID: 4532 waited 5135787 ns here:
 0xffffffffc18a815b : 0xffffffffc18a815b
[stap_eb6b67472fc672bbb457a915ab0fb97_10634+0x915b/0x0]
 0xffffffffc18a964b : 0xffffffffc18a964b
[stap_eb6b67472fc672bbb457a915ab0fb97_10634+0xa64b/0x0]
 0xffffffffc18ab07a : 0xffffffffc18ab07a
[stap_eb6b67472fc672bbb457a915ab0fb97_10634+0xc07a/0x0]
 0xffffffffc189f014 : 0xffffffffc189f014
[stap_eb6b67472fc672bbb457a915ab0fb97_10634+0x14/0x0]
 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel]
 0xffffffff8176d839 : schedule+0x29/0x70 [kernel]
 0xffffffffc0857c2d : 0xffffffffc0857c2d [kvm]
 0xffffffff810b6250 : autoremove_wake_function+0x0/0x40 [kernel]
 0xffffffffc0870f41 : 0xffffffffc0870f41 [kvm]
 0xffffffffc085ace2 : 0xffffffffc085ace2 [kvm]
 0xffffffff812132d8 : fsnotify+0x228/0x2f0 [kernel]
 0xffffffff810e7e9a : do_futex+0x10a/0x6a0 [kernel]
 0xffffffff811e8890 : do_vfs_ioctl+0x2e0/0x4c0 [kernel]
 0xffffffffc0864ce4 : 0xffffffffc0864ce4 [kvm]
 0xffffffff811e8af1 : sys_ioctl+0x81/0xa0 [kernel]
 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel]


Worker threads (KVM worker or kernel worker):


TID: 12139 waited 7939986 here:
 0xffffffffc1e4f15b : 0xffffffffc1e4f15b
[stap_18a0b929d6dbf8774f2b8457b465301_12541+0x915b/0x0]
 0xffffffffc1e5065b : 0xffffffffc1e5065b
[stap_18a0b929d6dbf8774f2b8457b465301_12541+0xa65b/0x0]
 0xffffffffc1e5209a : 0xffffffffc1e5209a
[stap_18a0b929d6dbf8774f2b8457b465301_12541+0xc09a/0x0]
 0xffffffffc1e46014 : 0xffffffffc1e46014
[stap_18a0b929d6dbf8774f2b8457b465301_12541+0x14/0x0]
 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel]
 0xffffffff8176d839 : schedule+0x29/0x70 [kernel]
 0xffffffff810e511e : futex_wait_queue_me+0xde/0x140 [kernel]
 0xffffffff810e5c42 : futex_wait+0x182/0x290 [kernel]
 0xffffffff810e7e6e : do_futex+0xde/0x6a0 [kernel]
 0xffffffff810e84a1 : SyS_futex+0x71/0x150 [kernel]
 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel]


TID: 12139 waited 11219902 here:
 0xffffffffc1e4f15b : 0xffffffffc1e4f15b
[stap_18a0b929d6dbf8774f2b8457b465301_12541+0x915b/0x0]
 0xffffffffc1e5065b : 0xffffffffc1e5065b
[stap_18a0b929d6dbf8774f2b8457b465301_12541+0xa65b/0x0]
 0xffffffffc1e5209a : 0xffffffffc1e5209a
[stap_18a0b929d6dbf8774f2b8457b465301_12541+0xc09a/0x0]
 0xffffffffc1e46014 : 0xffffffffc1e46014
[stap_18a0b929d6dbf8774f2b8457b465301_12541+0x14/0x0]
 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel]
 0xffffffff8176dd49 : schedule_preempt_disabled+0x29/0x70 [kernel]
 0xffffffff8176fb95 : __mutex_lock_slowpath+0xd5/0x1d0 [kernel]
 0xffffffff8176fcaf : mutex_lock+0x1f/0x2f [kernel]
 0xffffffffc073c531 : 0xffffffffc073c531 [xfs]
 0xffffffffc07a788a : 0xffffffffc07a788a [xfs]
 0xffffffffc073dd4c : 0xffffffffc073dd4c [xfs]
 0xffffffff811d5251 : new_sync_write+0x81/0xb0 [kernel]
 0xffffffff811d5a07 : vfs_write+0xb7/0x1f0 [kernel]
 0xffffffff811d6732 : sys_pwrite64+0x72/0xb0 [kernel]
 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel]


Looking at this, does it suggest that the bottleneck is locking on the
VFS layer? Or does my setup actually do DirectIO on the host level?
You and Sanidhya mentioned that XFS is good at concurrent DirectIO as
it doesn't hold lock on file, but I do see this in the trace:

 0xffffffff8176fcaf : mutex_lock+0x1f/0x2f [kernel]
 0xffffffffc073c531 : 0xffffffffc073c531 [xfs]
 0xffffffffc07a788a : 0xffffffffc07a788a [xfs]
 0xffffffffc073dd4c : 0xffffffffc073dd4c [xfs]

So either KVM is not doing directIO or there is some lock xfs must
hold to do the write, right?


Regards,
Premysl Kouril


On Sat, Feb 13, 2016 at 3:15 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Feb 12, 2016 at 06:38:47PM +0100, Premysl Kouril wrote:
>> > All of this being said, what are you trying to do?  If you are happy
>> > using LVM, feel free to use it.  If there are specific features that
>> > you want out of the file system, it's best that you explicitly
>> > identify what you want, and so we can minimize the cost of the
>> > features of what you want.
>>
>>
>> We are trying to decide whether to use filesystem or LVM for VM
>> storage. It's not that we are happy with LVM - while it performs
>> better there are limitations on LVM side especially when it comes to
>> manageability (for example certain features in OpenStack do only fork
>> if VM is file-based).
>>
>> So, in short, if we would make filesystem to perform better we would
>> rather use filesystem than LVM, (and we don't really have any special
>> requirements in terms of filesystem features).
>>
>> And in order for us to make a good decision I wanted to ask community,
>> if our observations and resultant numbers make sense.
>
> For ext4, this is what you are going to get.
>
> How about you try XFS? After all, concurrent direct IO writes is
> something it is rather good at.
>
> i.e. use XFS in both your host and guest. Use raw image files on the
> host, and to make things roughly even with LVM you'll want to
> preallocate them. If you don't want to preallocate them (i.e. sparse
> image files) set them up with an extent size hint of at least 1MB so
> that it limits fragmentation of the image file.  Then configure qemu
> to use cache=none for it's IO to the image file.
>
> On the first write pass to the image file (in either case), you
> should see ~70-80% of the native underlying device performance
> because there is some overhead in either allocation (sparse image
> file) or unwritten extent conversion (preallocated image file).
> This, of course, asssumes you are not CPU limited in the QEMU
> process by the addition CPU overhead of file block mapping in the
> host filesystem vs raw block device IO.
>
> On the second write pass you should see 98-99% of the native
> underlying device performance (again with the assumption that CPU
> overhead of the host filesystem isn't a limiting factor).
>
> As an example, I have a block device that can sustain just under 36k
> random 4k write IOPS on my host. I have an XFS filesystem (default
> configs) on that 400GB block device. I created a sparse 500TB image
> file using:
>
> # xfs_io -f -c "extsize 1m" -c "truncate 500t" vm-500t.img
>
> And push it into a 16p/16GB RAM guest via:
>
> -drive file=/mnt/fast-ssd/vm-500t.img,if=virtio,cache=none,format=raw
>
> and in the guest run mkfs.xfs with defaults and mount it with
> defaults. Then I ran your fio test on that 5 times in a row:
>
> write: io=3072.0MB, bw=106393KB/s, iops=26598, runt= 29567msec
> write: io=3072.0MB, bw=141508KB/s, iops=35377, runt= 22230msec
> write: io=3072.0MB, bw=141254KB/s, iops=35313, runt= 22270msec
> write: io=3072.0MB, bw=141115KB/s, iops=35278, runt= 22292msec
> write: io=3072.0MB, bw=141534KB/s, iops=35383, runt= 22226msec
>
> The first run was 26k IOPS, the rest were at 35k IOPS as they
> overwrite the same blocks in the image file. IOWs, first pass at 75%
> of device capability, the rest at > 98% of the host measured device
> capability. All tests reported the full io depth was being used in
> the guest:
>
> IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>
> The guest OS measured about 30% CPU usage for a single fio run at
> 35k IOPS:
>
> real    0m22.648s
> user    0m1.678s
> sys     0m8.175s
>
> However, the QEMU process on the host required 4 entire CPUs to
> sustain this IO load, roughly 50/50 user/system time. IOWs, a large
> amount of the CPU overhead on such workloads is on the host side in
> QEMU, not the guest.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com