All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Schmidt <mail@jan-o-sch.net>
To: Mark Fasheh <mfasheh@suse.de>
Cc: linux-btrfs@vger.kernel.org, Chris Mason <chris.mason@oracle.com>,
	Josef Bacik <josef@redhat.com>
Subject: Re: [PATCH 1/3] btrfs: extended inode refs
Date: Wed, 25 Apr 2012 12:19:21 +0200	[thread overview]
Message-ID: <4F97CFA9.9010200@jan-o-sch.net> (raw)
In-Reply-To: <20120424222356.GS17950@wotan.suse.de>

On 25.04.2012 00:23, Mark Fasheh wrote:
> Jan, firstly thanks very much for the thorough review! I wanted to make sure
> I got the collision handling going before addressing the issues you found.
> Again thanks, and my notes are below.



> On Thu, Apr 12, 2012 at 03:08:27PM +0200, Jan Schmidt wrote:
>>> +/*
>>> + * Theoretical limit is larger, but we keep this down to a sane
>>> + * value. That should limit greatly the possibility of collisions on
>>> + * inode ref items.
>>> + */
>>> +#define BTRFS_LINK_MAX 65535U
>>
>> Do we really want an artificial limit like that?
> 
> I took a look at other file systems and this seems to be a pretty reasonable
> maximum. Ext4 and XFS in particular have similar limits.
> 
> Granted, btrfs should be better equipped to deal with large numbers of hard
> links - at least in the area of consistency checking.
> 
> That said, I think keeping this down to "only" 65 thousand hard links per
> inode should work out fine. Also it's not a large change to increase or
> remove the limit should we have to.

Agreed.

>>> -struct btrfs_inode_ref *
>>> +static int find_name_in_ext_backref(struct btrfs_path *path, const char *name,
>>> +				   int name_len, struct btrfs_inode_extref **ref_ret)
>>> +{
>>> +	struct extent_buffer *leaf;
>>> +	struct btrfs_inode_extref *ref;
>>> +	unsigned long ptr;
>>> +	unsigned long name_ptr;
>>> +	u32 item_size;
>>> +
>>> +	leaf = path->nodes[0];
>>> +	item_size = btrfs_item_size_nr(leaf, path->slots[0]);
>>> +	ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
>>> +	ref = (struct btrfs_inode_extref *)ptr;
>>> +
>>> +	/*
>>> +	 * There isn't actually anything to "search" in an extended
>>> +	 * backref - one name is directly attached to each
>>> +	 * item. Instead what we're really doing here is some sort of
>>> +	 * sanity check - does the name exist where it should have
>>> +	 * been found. If all is well, we'll return success and the
>>> +	 * inode ref object.
>>> +	 */
>>
>> In that case, the name is not a good choice. However, if we change how
>> we deal with hash collisions, the name might be right and the comment
>> might go away as we really look through more than one item. If we leave
>> it like this, please rename that function.
> 
> Yeah, great catch. In the newer version as you've noted, it *is* actually a
> 'find me the name' function, so the comment (and resulting EROFS) have been
> removed.
> 
> 
>>> +	name_ptr = (unsigned long)(&ref->name);
>>> +	if (btrfs_inode_extref_name_len(leaf, ref) == name_len
>>> +	    && (memcmp_extent_buffer(leaf, name, name_ptr, name_len) == 0)) {
>>> +		*ref_ret = ref;
>>> +		return 1;
>>> +	}
>>> +	return 0;
>>> +}
>>> +
>>> +/*
>>> + * Figure the key offset of an extended inode ref
>>> + */
>>> +u64 btrfs_extref_key_off(u64 parent_objectid, const char *name, int name_len)
>>
>> I'd suggest to call this one btrfs_extref_hash and move it to
>> fs/btrfs/hash.h. That way, we can also drop the include for
>> <linux/crc32c.h> above.
> 
> Done.
> 
> 
>>> +{
>>> +	return (u64) crc32c(parent_objectid, name, name_len);
>>> +}
>>> +
>>> +static struct btrfs_inode_ref *
>>>  btrfs_lookup_inode_ref(struct btrfs_trans_handle *trans,
>>> -			struct btrfs_root *root,
>>> -			struct btrfs_path *path,
>>> -			const char *name, int name_len,
>>> -			u64 inode_objectid, u64 ref_objectid, int mod)
>>> +		       struct btrfs_root *root,
>>> +		       struct btrfs_path *path,
>>> +		       const char *name, int name_len,
>>> +		       u64 inode_objectid, u64 ref_objectid, int ins_len,
>>> +		       int cow)
>>>  {
>>> +	int ret;
>>>  	struct btrfs_key key;
>>>  	struct btrfs_inode_ref *ref;
>>> -	int ins_len = mod < 0 ? -1 : 0;
>>> -	int cow = mod != 0;
>>> -	int ret;
>>>  
>>>  	key.objectid = inode_objectid;
>>>  	key.type = BTRFS_INODE_REF_KEY;
>>> @@ -76,20 +115,137 @@ btrfs_lookup_inode_ref(struct btrfs_trans_handle *trans,
>>>  	return ref;
>>>  }
>>>  
>>> -int btrfs_del_inode_ref(struct btrfs_trans_handle *trans,
>>> +struct btrfs_inode_extref *
>>> +btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans,
>>> +			  struct btrfs_root *root,
>>> +			  struct btrfs_path *path,
>>> +			  const char *name, int name_len,
>>> +			  u64 inode_objectid, u64 ref_objectid, int ins_len,
>>> +			  int cow)
>>> +{
>>> +	int ret;
>>> +	struct btrfs_key key;
>>> +	struct btrfs_inode_extref *ref;
>>                                    ^^^
>> Please use some other identifier than the one we use for struct
>> btrfs_inode_ref. Suggestion: extref or eref.
> 
> Yeah I'll make sure they're all changed to extref. Makes it much easier to
> read.
> 
> 
>>> +
>>> +	key.objectid = inode_objectid;
>>> +	key.type = BTRFS_INODE_EXTREF_KEY;
>>> +	key.offset = btrfs_extref_key_off(ref_objectid, name, name_len);
>>                      ^^^^^^^^^^^^^^^^^^^^
>> Get's clearer when that is called btrfs_extref_hash.
>>
>>> +
>>> +	ret = btrfs_search_slot(trans, root, &key, path, ins_len, cow);
>>> +	if (ret < 0)
>>> +		return ERR_PTR(ret);
>>> +	if (ret > 0)
>>> +		return NULL;
>>> +	if (!find_name_in_ext_backref(path, name, name_len, &ref))
>>> +		return NULL;
>>> +	return ref;
>>> +}
>>> +
>>> +int btrfs_get_inode_ref_index(struct btrfs_trans_handle *trans,
>>> +			      struct btrfs_root *root,
>>> +			      struct btrfs_path *path,
>>> +			      const char *name, int name_len,
>>> +			      u64 inode_objectid, u64 ref_objectid, int mod,
>>> +			      u64 *ret_index)
>>> +{
>>> +	struct btrfs_inode_ref *ref1;
>>> +	struct btrfs_inode_extref *ref2;
>>                                    ^^^^
>> Please, use "ref" and ("eref" or "extref").
>>
>>> +	int ins_len = mod < 0 ? -1 : 0;
>>> +	int cow = mod != 0;
>>> +
>>> +	ref1 = btrfs_lookup_inode_ref(trans, root, path, name, name_len,
>>> +				      inode_objectid, ref_objectid, ins_len,
>>> +				      cow);
>>> +	if (IS_ERR(ref1))
>>> +		return PTR_ERR(ref1);
>>> +
>>> +	if (ref1 != NULL) {
>>> +		*ret_index = btrfs_inode_ref_index(path->nodes[0], ref1);
>>> +		return 0;
>>> +	}
>>> +
>>> +	btrfs_release_path(path);
>>> +
>>> +	ref2 = btrfs_lookup_inode_extref(trans, root, path, name,
>>> +					 name_len, inode_objectid,
>>> +					 ref_objectid, ins_len, cow);
>>> +	if (IS_ERR(ref2))
>>> +		return PTR_ERR(ref2);
>>> +
>>> +	if (ref2) {
>>> +		*ret_index = btrfs_inode_extref_index(path->nodes[0], ref2);
>>> +		return 0;
>>> +	}
>>> +
>>> +	return -ENOENT;
>>> +}
>>> +
>>> +int btrfs_del_inode_extref(struct btrfs_trans_handle *trans,
>>>  			   struct btrfs_root *root,
>>>  			   const char *name, int name_len,
>>>  			   u64 inode_objectid, u64 ref_objectid, u64 *index)
>>>  {
>>>  	struct btrfs_path *path;
>>>  	struct btrfs_key key;
>>> +	struct btrfs_inode_extref *ref2;
>>                                    ^^^^
>>
>>> +	struct extent_buffer *leaf;
>>> +	int ret;
>>> +
>>> +	key.objectid = inode_objectid;
>>> +	btrfs_set_key_type(&key, BTRFS_INODE_EXTREF_KEY);
>>> +	key.offset = btrfs_extref_key_off(ref_objectid, name, name_len);
>>> +
>>> +	path = btrfs_alloc_path();
>>> +	if (!path)
>>> +		return -ENOMEM;
>>> +
>>> +	path->leave_spinning = 1;
>>> +
>>> +	ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
>>> +	if (ret > 0) {
>>> +		ret = -ENOENT;
>>> +		goto out;
>>> +	} else if (ret < 0) {
>>> +		goto out;
>>> +	}
>>
>> Clearer (to me):
>>
>> if (ret > 0)
>> 	ret = -ENOENT;
>> if (ret < 0)
>> 	goto out;
> 
> Agreed, changed.
> 
>>
>>> +
>>> +	/*
>>> +	 * Sanity check - did we find the right item for this name?
>>> +	 * This should always succeed so error here will make the FS
>>> +	 * readonly.
>>> +	 */
>>> +	if (!find_name_in_ext_backref(path, name, name_len, &ref2)) {
>>> +		btrfs_std_error(root->fs_info, -ENOENT);
>>> +		ret = -EROFS;
>>
>> Simply returning -EROFS doesn't make the fs read-only, does it? I'd
>> suggest to return -EIO and drop the comment above.
> 
> Actually, the btrfs_std_error() line above that makes the FS readonly (in
> which case EROFS is a good return value).

Got it. -EROFS is a good choice, then.

> Btw, I think that since is this a request to delete the item that not
> finding it in the tree (even with the possiblity of collision handling) is
> still a possible corruption.
> 
> 
>>> +		goto out;
>>> +	}
>>> +
>>> +	leaf = path->nodes[0];
>>> +	if (index)
>>> +		*index = btrfs_inode_extref_index(leaf, ref2);
>>> +
>>> +	ret = btrfs_del_item(trans, root, path);
>>> +
>>> +out:
>>> +	btrfs_free_path(path);
>>> +
>>> +	return ret;
>>> +}
>>> +
>>> +int btrfs_del_inode_ref(struct btrfs_trans_handle *trans,
>>> +			struct btrfs_root *root,
>>> +			const char *name, int name_len,
>>> +			u64 inode_objectid, u64 ref_objectid, u64 *index)
>>> +{
>>> +	struct btrfs_path *path;
>>> +	struct btrfs_key key;
>>>  	struct btrfs_inode_ref *ref;
>>>  	struct extent_buffer *leaf;
>>>  	unsigned long ptr;
>>>  	unsigned long item_start;
>>>  	u32 item_size;
>>>  	u32 sub_item_len;
>>> -	int ret;
>>> +	int ret, search_ext_refs = 0;
>>                ^^^^^^^^^^^^^^^^^^^^^
>> Documentation/CodingStyle:
>> --
>> To this end, use just one data declaration per line (no commas for
>> multiple data declarations).
>> --
>>
>> You do this multiple times in your whole patch set.
> 
> Sure, I can change that. There's lots of kernel code that blatantly ignores
> this but I suppose that's no excuse...
> 
> 
>>>  	int del_len = name_len + sizeof(*ref);
>>>  
>>>  	key.objectid = inode_objectid;
>>> @@ -105,12 +261,14 @@ int btrfs_del_inode_ref(struct btrfs_trans_handle *trans,
>>>  	ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
>>>  	if (ret > 0) {
>>>  		ret = -ENOENT;
>>> +		search_ext_refs = 1;
>>>  		goto out;
>>>  	} else if (ret < 0) {
>>>  		goto out;
>>>  	}
>>>  	if (!find_name_in_backref(path, name, name_len, &ref)) {
>>>  		ret = -ENOENT;
>>> +		search_ext_refs = 1;
>>>  		goto out;
>>>  	}
>>>  	leaf = path->nodes[0];
>>> @@ -132,6 +290,59 @@ int btrfs_del_inode_ref(struct btrfs_trans_handle *trans,
>>>  				  item_size - sub_item_len, 1);
>>>  out:
>>>  	btrfs_free_path(path);
>>> +
>>> +	if (search_ext_refs) {
>>
>> As far as I see it, you can drop search_ext_refs and simply go this way
>> on ret == -ENOENT.
> 
> I actually made it explicit on purpose. If you look at the function there's
> several places where ret can be set and then bubbled down. Instead of
> depending on the called functions never setting -ENOENT (and future
> programmers to get that it's a 'magic' value), I prefer to just explicitely
> set a single int variable.

Okay, sounds reasonable.

>>> +		/* 
>>                   ^
>> Catched by accident: trailing whitespace.
> 
> Oh thanks.
> 
>>
>>> +		 * No refs were found, or we could not find the
>>> +		 * name in our ref array. Find and remove the extended
>>> +		 * inode ref then.
>>> +		 */
>>> +		return btrfs_del_inode_extref(trans, root, name, name_len,
>>> +					      inode_objectid, ref_objectid, index);
>>> +	}
>>> +
>>> +	return ret;
>>> +}
>>> +
>>> +static int btrfs_insert_inode_extref(struct btrfs_trans_handle *trans,
>>> +				     struct btrfs_root *root,
>>> +				     const char *name, int name_len,
>>> +				     u64 inode_objectid, u64 ref_objectid, u64 index)
>>
>> Are we missing a check against BTRFS_LINK_MAX in this function?
> 
> No, the caller is supposed to check this, in theory before starting a
> transaction to change the FS. So btrfs_link() does the checking for us.

Is that worth a comment? Or is that clear to anybody familiar with such
things anyway (in which case I'd not add that comment)?

>>> +{
>>> +	struct btrfs_path *path;
>>> +	struct btrfs_key key;
>>> +	struct btrfs_inode_extref *ref;
>> extref
>>
>>> +	unsigned long ptr;
>>> +	int ret;
>>> +	int ins_len = name_len + sizeof(*ref);
>>> +
>>> +	key.objectid = inode_objectid;
>>> +	key.type = BTRFS_INODE_EXTREF_KEY;
>>> +	key.offset = btrfs_extref_key_off(ref_objectid, name, name_len);
>>> +
>>> +	path = btrfs_alloc_path();
>>> +	if (!path)
>>> +		return -ENOMEM;
>>> +
>>> +	path->leave_spinning = 1;
>>> +	ret = btrfs_insert_empty_item(trans, root, path, &key,
>>> +				      ins_len);
>>> +	if (ret < 0)
>>> +		goto out;
>>> +
>>> +	ref = btrfs_item_ptr(path->nodes[0], path->slots[0],
>>> +			     struct btrfs_inode_extref);
>>> +
>>> +	btrfs_set_inode_extref_name_len(path->nodes[0], ref, name_len);
>>> +	btrfs_set_inode_extref_index(path->nodes[0], ref, index);
>>> +	btrfs_set_inode_extref_parent(path->nodes[0], ref, ref_objectid);
>>> +
>>> +	ptr = (unsigned long)&ref->name;
>>> +	write_extent_buffer(path->nodes[0], name, ptr, name_len);
>>> +	btrfs_mark_buffer_dirty(path->nodes[0]);
>>> +
>>> +out:
>>> +	btrfs_free_path(path);
>>>  	return ret;
>>>  }
>>>  
>>> @@ -189,6 +400,19 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
>>>  
>>>  out:
>>>  	btrfs_free_path(path);
>>> +
>>> +	if (ret == -EMLINK) {
>>> +		struct btrfs_super_block *disk_super = root->fs_info->super_copy;
>>> +		/* We ran out of space in the ref array. Need to
>>> +		 * add an extended ref. */
>>> +		if (btrfs_super_incompat_flags(disk_super)
>>> +		    & BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)
>>> +			ret = btrfs_insert_inode_extref(trans, root, name,
>>> +							name_len,
>>> +							inode_objectid,
>>> +							ref_objectid, index);
>>> +	}
>>> +
>>>  	return ret;
>>>  }
>>>  
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>> index 892b347..0ca525d 100644
>>> --- a/fs/btrfs/inode.c
>>> +++ b/fs/btrfs/inode.c
>>> @@ -2689,7 +2689,6 @@ static struct btrfs_trans_handle *__unlink_start_trans(struct inode *dir,
>>>  	struct btrfs_trans_handle *trans;
>>>  	struct btrfs_root *root = BTRFS_I(dir)->root;
>>>  	struct btrfs_path *path;
>>> -	struct btrfs_inode_ref *ref;
>>>  	struct btrfs_dir_item *di;
>>>  	struct inode *inode = dentry->d_inode;
>>>  	u64 index;
>>> @@ -2803,17 +2802,16 @@ static struct btrfs_trans_handle *__unlink_start_trans(struct inode *dir,
>>>  	}
>>>  	btrfs_release_path(path);
>>>  
>>> -	ref = btrfs_lookup_inode_ref(trans, root, path,
>>> -				dentry->d_name.name, dentry->d_name.len,
>>> -				ino, dir_ino, 0);
>>> -	if (IS_ERR(ref)) {
>>> -		err = PTR_ERR(ref);
>>> +	ret = btrfs_get_inode_ref_index(trans, root, path, dentry->d_name.name,
>>> +					dentry->d_name.len, ino, dir_ino, 0, &index);
>>
>> This line is >80 chars. Please use scripts/checkpatch.pl to catch those.
> 
> Fixed.
> 
> 
>>> +	if (ret) {
>>> +		err = ret;
>>>  		goto out;
>>>  	}
>>> -	BUG_ON(!ref);
>>> +
>>>  	if (check_path_shared(root, path))
>>>  		goto out;
>>> -	index = btrfs_inode_ref_index(path->nodes[0], ref);
>>> +
>>>  	btrfs_release_path(path);
>>>  
>>>  	/*
>>> @@ -4484,6 +4482,10 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans,
>>>  	btrfs_set_key_type(&key[0], BTRFS_INODE_ITEM_KEY);
>>>  	key[0].offset = 0;
>>>  
>>> +	/*
>>> +	 * Start new inodes with a V1 ref. This is slightly more
>>> +	 * efficient for small numbers of hard links.
>>> +	 */
>>
>> I'd drop that comment. extref is no "V2" kind of ref. But you can leave
>> it in if you feel it helps anyone later. 
> 
> Yeah at one point they were called 'v2' due to developer laziness :)
> 
> I've updated the comment:
> 
> 	/*
> 	 * Start new inodes with an inode_ref. This is slightly more
> 	 * efficient for small numbers of hard links since they will
> 	 * be packed into one item. Extended refs will kick in if we
> 	 * add more hard links than can fit in the ref item.
> 	 */
> 
> 
> It could probably still be improved beyond that. Mostly I want to point out
> that we never start with extended refs so that anyone looking at the code
> will understand the 'flow' of btrfs inode references.

I think its fine like it stands above.

-Jan

>>>  	key[1].objectid = objectid;
>>>  	btrfs_set_key_type(&key[1], BTRFS_INODE_REF_KEY);
>>>  	key[1].offset = ref_objectid;
>>> @@ -4777,7 +4779,7 @@ static int btrfs_link(struct dentry *old_dentry, struct inode *dir,
>>>  	if (root->objectid != BTRFS_I(inode)->root->objectid)
>>>  		return -EXDEV;
>>>  
>>> -	if (inode->i_nlink == ~0U)
>>> +	if (inode->i_nlink >= BTRFS_LINK_MAX)
>>>  		return -EMLINK;
>>>  
>>>  	err = btrfs_set_inode_index(dir, &index);
>>
>> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
> 
> Thanks again Jan. I'll be sure to CC you on my next drop!
> 	--Mark
> 
> --
> Mark Fasheh

  reply	other threads:[~2012-04-25 10:19 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-05 20:09 [PATCH 0/3] " Mark Fasheh
2012-04-05 20:09 ` [PATCH 1/3] " Mark Fasheh
2012-04-12 13:08   ` Jan Schmidt
2012-04-24 22:23     ` Mark Fasheh
2012-04-25 10:19       ` Jan Schmidt [this message]
2012-04-05 20:09 ` [PATCH 2/3] " Mark Fasheh
2012-04-12 13:08   ` Jan Schmidt
2012-05-03 23:12     ` Mark Fasheh
2012-05-04 11:39       ` David Sterba
2012-04-12 15:53   ` Jan Schmidt
2012-05-01 18:39     ` Mark Fasheh
2012-04-05 20:09 ` [PATCH 3/3] " Mark Fasheh
2012-04-12 17:59   ` Jan Schmidt
2012-04-12 18:38     ` Jan Schmidt
2012-05-08 22:57     ` Mark Fasheh
2012-05-09 17:02       ` Chris Mason
2012-05-10  8:23         ` Jan Schmidt
2012-05-10 13:35           ` Chris Mason
2012-04-05 21:13 ` [PATCH 0/3] " Jeff Mahoney
2012-04-11 13:11   ` Jan Schmidt
2012-04-11 13:29     ` Jan Schmidt
2012-04-12 16:11     ` Chris Mason
2012-04-12 16:19       ` Mark Fasheh
2012-04-06  1:24 ` Liu Bo
2012-04-06  2:12   ` Liu Bo
2012-05-21 21:46 Mark Fasheh
2012-05-21 21:46 ` [PATCH 1/3] " Mark Fasheh
2012-07-06 14:56   ` Jan Schmidt
2012-07-06 15:14     ` Stefan Behrens
2012-07-09 19:05     ` Mark Fasheh
2012-07-09 20:33     ` Mark Fasheh
2012-08-08 18:55 [PATCH 0/3] " Mark Fasheh
2012-08-08 18:55 ` [PATCH 1/3] " Mark Fasheh
2012-08-14  9:32   ` Jan Schmidt
2012-08-14 16:51     ` Mark Fasheh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F97CFA9.9010200@jan-o-sch.net \
    --to=mail@jan-o-sch.net \
    --cc=chris.mason@oracle.com \
    --cc=josef@redhat.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=mfasheh@suse.de \
    --subject='Re: [PATCH 1/3] btrfs: extended inode refs' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.