From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vladimir Bashkirtsev Subject: Re: Poor read performance in KVM Date: Sat, 21 Jul 2012 02:23:25 +0930 Message-ID: <50098D05.2080704@bashkirtsev.com> References: <5002C215.108@bashkirtsev.com> <5003B1CC.4060909@inktank.com> <50064DCD.8040904@bashkirtsev.com> <5006D5FB.8030700@inktank.com> <50080D9D.8010306@bashkirtsev.com> <50085518.80507@inktank.com> <500984AC.9030104@bashkirtsev.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail.logics.net.au ([150.101.56.178]:41440 "EHLO mail.logics.net.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752846Ab2GTQxg (ORCPT ); Fri, 20 Jul 2012 12:53:36 -0400 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Tommi Virtanen Cc: Josh Durgin , ceph-devel On 21/07/2012 2:12 AM, Tommi Virtanen wrote: > On Fri, Jul 20, 2012 at 9:17 AM, Vladimir Bashkirtsev > wrote: >> not running. So I ended up rebooting hosts and that's where fun begin: btrfs >> has failed to umount , on boot up it spit out "btrfs: free space inode >> generation (0) did not match free space cache generation (177431)". I have >> not started ceph and made an attempt to umount and umount just froze. >> Another reboot: same stuff. I have rebooted second host and it came back >> with the same error. So in effect I was unable to mount btrfs and read it: >> no wonder that ceph was unable to run. Actually according to mons ceph was > The btrfs developers tend to be good about bug reports that severe -- > I think you should email that mailing list and ask if that sounds like > known bug, and ask what information you should capture if it happens > again (assuming the workload is complex enough that you can't easily > capture/reproduce all of that). Well... Work load was fairly high - not something usually happening on MySQL. Our client keeps imagery in MySQL and his system was regenerating images (it takes hi-res image and produces five or six images which are of smaller size + watermark). Stuff runs imagemagick which keeps its temporary data on disk (and to ceph it is not really temporary data - it is data which must be committed to osds) and then innodb in MySQL stores results - which of course creates number of pages and so it appears as random writes to underlying file system. And from what I have seen write traffic created by this process was in TB range (my whole ceph cluster is just 3.3TB). So it was considerable amount of changes on filesystem. I guess if we will start that process again we will end up with the similar result in few days - but by some reason I don't want to try it in production system :) I can scavenge something from logs and post it to btrfs devs. Thanks for a tip. > >> But it leaves me with very final question: should we rely on btrfs at this >> point given it is having such major faults? What if I will use well tested >> by time ext4? > You might want to try xfs. We hear/see problems with all three, but > xfs currently seems to have the best long-term performance and > reliability. > > I'm not sure if anyone's run detailed tests with ext4 after the > xattrs-in-leveldb feature; before that, we ran into fs limitations. That's what I was thinking: before xattrs-in-leveldb I even did not consider ext4 as viable alternative but now it may be reasonable to give it a go. Or even may be have a mix of osds backed by different file systems? What is devs opinion on this?