From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from aserp1040.oracle.com ([141.146.126.69]:29179 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752997AbcBOTLi (ORCPT ); Mon, 15 Feb 2016 14:11:38 -0500 Date: Mon, 15 Feb 2016 11:11:10 -0800 From: Liu Bo To: Premysl Kouril Cc: Dave Chinner , "Theodore Ts'o" , Andi Kleen , linux-fsdevel@vger.kernel.org Subject: Re: EXT4 vs LVM performance for VMs Message-ID: <20160215191110.GC23230@localhost.localdomain> Reply-To: bo.li.liu@oracle.com References: <87twlee9to.fsf@tassilo.jf.intel.com> <20160212133825.GJ11298@thunk.org> <20160212165319.GB7928@thunk.org> <20160213021509.GO19486@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Mon, Feb 15, 2016 at 07:56:05PM +0100, Premysl Kouril wrote: > Hello Dave, > > thanks for your suggestion. I've just recreated our tests with the XFS > and preallocated raw files and the results seem almost same as with > the EXT4. I again checked stuff with my Systemtap script and threads > are waiting mostly waiting for locks in following placeS: > > The main KVM thread: > > > TID: 4532 waited 5135787 ns here: > 0xffffffffc18a815b : 0xffffffffc18a815b > [stap_eb6b67472fc672bbb457a915ab0fb97_10634+0x915b/0x0] > 0xffffffffc18a964b : 0xffffffffc18a964b > [stap_eb6b67472fc672bbb457a915ab0fb97_10634+0xa64b/0x0] > 0xffffffffc18ab07a : 0xffffffffc18ab07a > [stap_eb6b67472fc672bbb457a915ab0fb97_10634+0xc07a/0x0] > 0xffffffffc189f014 : 0xffffffffc189f014 > [stap_eb6b67472fc672bbb457a915ab0fb97_10634+0x14/0x0] > 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel] > 0xffffffff8176d839 : schedule+0x29/0x70 [kernel] > 0xffffffffc0857c2d : 0xffffffffc0857c2d [kvm] > 0xffffffff810b6250 : autoremove_wake_function+0x0/0x40 [kernel] > 0xffffffffc0870f41 : 0xffffffffc0870f41 [kvm] > 0xffffffffc085ace2 : 0xffffffffc085ace2 [kvm] > 0xffffffff812132d8 : fsnotify+0x228/0x2f0 [kernel] > 0xffffffff810e7e9a : do_futex+0x10a/0x6a0 [kernel] > 0xffffffff811e8890 : do_vfs_ioctl+0x2e0/0x4c0 [kernel] > 0xffffffffc0864ce4 : 0xffffffffc0864ce4 [kvm] > 0xffffffff811e8af1 : sys_ioctl+0x81/0xa0 [kernel] > 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel] > > > > Worker threads (KVM worker or kernel worker): > > > TID: 12139 waited 7939986 here: > 0xffffffffc1e4f15b : 0xffffffffc1e4f15b > [stap_18a0b929d6dbf8774f2b8457b465301_12541+0x915b/0x0] > 0xffffffffc1e5065b : 0xffffffffc1e5065b > [stap_18a0b929d6dbf8774f2b8457b465301_12541+0xa65b/0x0] > 0xffffffffc1e5209a : 0xffffffffc1e5209a > [stap_18a0b929d6dbf8774f2b8457b465301_12541+0xc09a/0x0] > 0xffffffffc1e46014 : 0xffffffffc1e46014 > [stap_18a0b929d6dbf8774f2b8457b465301_12541+0x14/0x0] > 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel] > 0xffffffff8176d839 : schedule+0x29/0x70 [kernel] > 0xffffffff810e511e : futex_wait_queue_me+0xde/0x140 [kernel] > 0xffffffff810e5c42 : futex_wait+0x182/0x290 [kernel] > 0xffffffff810e7e6e : do_futex+0xde/0x6a0 [kernel] > 0xffffffff810e84a1 : SyS_futex+0x71/0x150 [kernel] > 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel] > > > > TID: 12139 waited 11219902 here: > 0xffffffffc1e4f15b : 0xffffffffc1e4f15b > [stap_18a0b929d6dbf8774f2b8457b465301_12541+0x915b/0x0] > 0xffffffffc1e5065b : 0xffffffffc1e5065b > [stap_18a0b929d6dbf8774f2b8457b465301_12541+0xa65b/0x0] > 0xffffffffc1e5209a : 0xffffffffc1e5209a > [stap_18a0b929d6dbf8774f2b8457b465301_12541+0xc09a/0x0] > 0xffffffffc1e46014 : 0xffffffffc1e46014 > [stap_18a0b929d6dbf8774f2b8457b465301_12541+0x14/0x0] > 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel] > 0xffffffff8176dd49 : schedule_preempt_disabled+0x29/0x70 [kernel] > 0xffffffff8176fb95 : __mutex_lock_slowpath+0xd5/0x1d0 [kernel] > 0xffffffff8176fcaf : mutex_lock+0x1f/0x2f [kernel] > 0xffffffffc073c531 : 0xffffffffc073c531 [xfs] > 0xffffffffc07a788a : 0xffffffffc07a788a [xfs] > 0xffffffffc073dd4c : 0xffffffffc073dd4c [xfs] > 0xffffffff811d5251 : new_sync_write+0x81/0xb0 [kernel] > 0xffffffff811d5a07 : vfs_write+0xb7/0x1f0 [kernel] > 0xffffffff811d6732 : sys_pwrite64+0x72/0xb0 [kernel] > 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel] > > > > > Looking at this, does it suggest that the bottleneck is locking on the > VFS layer? Or does my setup actually do DirectIO on the host level? > You and Sanidhya mentioned that XFS is good at concurrent DirectIO as > it doesn't hold lock on file, but I do see this in the trace: > > 0xffffffff8176fcaf : mutex_lock+0x1f/0x2f [kernel] > 0xffffffffc073c531 : 0xffffffffc073c531 [xfs] > 0xffffffffc07a788a : 0xffffffffc07a788a [xfs] > 0xffffffffc073dd4c : 0xffffffffc073dd4c [xfs] > > So either KVM is not doing directIO or there is some lock xfs must > hold to do the write, right? Is this gathered when qemu is binded to single CPU? fio takes iodepth=64, but blk-mq uses per-cpu or per-node queue. Not sure if blk-mq is available on 3.16.0. Thanks, -liubo > > > Regards, > Premysl Kouril > > > > > > > > On Sat, Feb 13, 2016 at 3:15 AM, Dave Chinner wrote: > > On Fri, Feb 12, 2016 at 06:38:47PM +0100, Premysl Kouril wrote: > >> > All of this being said, what are you trying to do? If you are happy > >> > using LVM, feel free to use it. If there are specific features that > >> > you want out of the file system, it's best that you explicitly > >> > identify what you want, and so we can minimize the cost of the > >> > features of what you want. > >> > >> > >> We are trying to decide whether to use filesystem or LVM for VM > >> storage. It's not that we are happy with LVM - while it performs > >> better there are limitations on LVM side especially when it comes to > >> manageability (for example certain features in OpenStack do only fork > >> if VM is file-based). > >> > >> So, in short, if we would make filesystem to perform better we would > >> rather use filesystem than LVM, (and we don't really have any special > >> requirements in terms of filesystem features). > >> > >> And in order for us to make a good decision I wanted to ask community, > >> if our observations and resultant numbers make sense. > > > > For ext4, this is what you are going to get. > > > > How about you try XFS? After all, concurrent direct IO writes is > > something it is rather good at. > > > > i.e. use XFS in both your host and guest. Use raw image files on the > > host, and to make things roughly even with LVM you'll want to > > preallocate them. If you don't want to preallocate them (i.e. sparse > > image files) set them up with an extent size hint of at least 1MB so > > that it limits fragmentation of the image file. Then configure qemu > > to use cache=none for it's IO to the image file. > > > > On the first write pass to the image file (in either case), you > > should see ~70-80% of the native underlying device performance > > because there is some overhead in either allocation (sparse image > > file) or unwritten extent conversion (preallocated image file). > > This, of course, asssumes you are not CPU limited in the QEMU > > process by the addition CPU overhead of file block mapping in the > > host filesystem vs raw block device IO. > > > > On the second write pass you should see 98-99% of the native > > underlying device performance (again with the assumption that CPU > > overhead of the host filesystem isn't a limiting factor). > > > > As an example, I have a block device that can sustain just under 36k > > random 4k write IOPS on my host. I have an XFS filesystem (default > > configs) on that 400GB block device. I created a sparse 500TB image > > file using: > > > > # xfs_io -f -c "extsize 1m" -c "truncate 500t" vm-500t.img > > > > And push it into a 16p/16GB RAM guest via: > > > > -drive file=/mnt/fast-ssd/vm-500t.img,if=virtio,cache=none,format=raw > > > > and in the guest run mkfs.xfs with defaults and mount it with > > defaults. Then I ran your fio test on that 5 times in a row: > > > > write: io=3072.0MB, bw=106393KB/s, iops=26598, runt= 29567msec > > write: io=3072.0MB, bw=141508KB/s, iops=35377, runt= 22230msec > > write: io=3072.0MB, bw=141254KB/s, iops=35313, runt= 22270msec > > write: io=3072.0MB, bw=141115KB/s, iops=35278, runt= 22292msec > > write: io=3072.0MB, bw=141534KB/s, iops=35383, runt= 22226msec > > > > The first run was 26k IOPS, the rest were at 35k IOPS as they > > overwrite the same blocks in the image file. IOWs, first pass at 75% > > of device capability, the rest at > 98% of the host measured device > > capability. All tests reported the full io depth was being used in > > the guest: > > > > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% > > > > The guest OS measured about 30% CPU usage for a single fio run at > > 35k IOPS: > > > > real 0m22.648s > > user 0m1.678s > > sys 0m8.175s > > > > However, the QEMU process on the host required 4 entire CPUs to > > sustain this IO load, roughly 50/50 user/system time. IOWs, a large > > amount of the CPU overhead on such workloads is on the host side in > > QEMU, not the guest. > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html