From mboxrd@z Thu Jan 1 00:00:00 1970 From: Josef Bacik Subject: Re: Delayed inode operations not doing the right thing with enospc Date: Thu, 14 Jul 2011 11:53:18 -0400 Message-ID: <4E1F10EE.6050706@redhat.com> References: <4DE92BF2.1060905@redhat.com> <4DED8143.3090803@cn.fujitsu.com> <4DEE9263.1000802@redhat.com> <4E1DB22F.1060405@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Cc: miaox@cn.fujitsu.com, linux-btrfs , ceph-devel@vger.kernel.org To: chb@muc.de Return-path: In-Reply-To: List-ID: On 07/14/2011 03:27 AM, Christian Brunner wrote: > 2011/7/13 Josef Bacik : >> On 07/12/2011 11:20 AM, Christian Brunner wrote: >>> 2011/6/7 Josef Bacik : >>>> On 06/06/2011 09:39 PM, Miao Xie wrote: >>>>> On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote: >>>>>> I got a lot of these when running stress.sh on my test box >>>>>> >>>>>> >>>>>> >>>>>> This is because use_block_rsv() is having to do a >>>>>> reserve_metadata_bytes(), which shouldn't happen as we should have >>>>>> reserved enough space for those operations to complete. This is >>>>>> happening because use_block_rsv() will call get_block_rsv(), which if >>>>>> root->ref_cows is set (which is the case on all fs roots) we will use >>>>>> trans->block_rsv, which will only have what the current transaction >>>>>> starter had reserved. >>>>>> >>>>>> What needs to be done instead is we need to have a block reserve that >>>>>> any reservation that is done at create time for these inodes is migrated >>>>>> to this special reserve, and then when you run the delayed inode items >>>>>> stuff you set trans->block_rsv to the special block reserve so the >>>>>> accounting is all done properly. >>>>>> >>>>>> This is just off the top of my head, there may be a better way to do it, >>>>>> I've not actually looked that the delayed inode code at all. >>>>>> >>>>>> I would do this myself but I have a ever increasing list of shit to do >>>>>> so will somebody pick this up and fix it please? Thanks, >>>>> >>>>> Sorry, it's my miss. >>>>> I forgot to set trans->block_rsv to global_block_rsv, since we have migrated >>>>> the space from trans_block_rsv to global_block_rsv. >>>>> >>>>> I'll fix it soon. >>>>> >>>> >>>> There is another problem, we're failing xfstest 204. I tried making >>>> reserve_metadata_bytes commit the transaction regardless of whether or >>>> not there were pinned bytes but the test just hung there. Usually it >>>> takes 7 seconds to run and I ctrl+c'ed it after a couple of minutes. >>>> 204 just creates a crap ton of files, which is what is killing us. >>>> There needs to be a way to start flushing delayed inode items so we can >>>> reclaim the space they are holding onto so we don't get enospc, and it >>>> needs to be better than just committing the transaction because that is >>>> dog slow. Thanks, >>>> >>>> Josef >>> >>> Is there a solution for this? >>> >>> I'm running a 2.6.38.8 kernel with all the btrfs patches from 3.0rc7 >>> (except the pluging). When starting a ceph rebuild on the btrfs >>> volumes I get a lot of warnings from block_rsv_use_bytes in >>> use_block_rsv: >>> >> >> Ok I think I've got this nailed down. Will you run with this patch and make sure the warnings go away? Thanks, > > I'm sorry, I'm still getting a lot of warnings like the one below. > > I've also noticed, that I'm not getting these messages when the > free_space_cache is disabled. > > Ok I see what's wrong, our checksum calculation is completely bogus. I'm in the middle of something big so I can't give you a nice clean patch, so if you can just go into extent-tree.c and replace calc_csum_metadata_size with this you should be good to go static u64 calc_csum_metadata_size(struct inode *inode, u64 num_bytes) { struct btrfs_root *root = BTRFS_I(inode)->root; int num_leaves; int num_csums; u16 csum_size = btrfs_super_csum_size(&root->fs_info->super_copy); num_csums = (int)div64_u64(num_bytes, root->sectorsize); num_leaves = (int)((num_csums * csum_size) / root->leafsize); return btrfs_calc_trans_metadata_size(root, num_leaves); } Thanks, Josef