All of
 help / color / mirror / Atom feed
From: Mark Fasheh <>
Cc: Chris Mason <>, Josef Bacik <>
Subject: [PATCH 0/3] btrfs: extended inode refs
Date: Thu,  5 Apr 2012 13:09:00 -0700	[thread overview]
Message-ID: <> (raw)

Currently btrfs has a limitation on the maximum number of hard links an
inode can have. Specifically, links are stored in an array of ref

struct btrfs_inode_ref {
	__le64 index;
	__le16 name_len;
	/* name goes here */
} __attribute__ ((__packed__));

The ref arrays are found via key triple:

(inode objectid, BTRFS_INODE_EXTREF_KEY, parent dir objectid)

Since items can not exceed the size of a leaf, the total number of links
that can be stored for a given inode / parent dir pair is limited to under
4k. This works fine for the most common case of few to only a handful of
links. Once the link count gets higher however, we begin to return EMLINK.

The following patches fix this situation by introducing a new ref item:

struct btrfs_inode_extref {
	__le64 parent_objectid;
	__le64 index;
	__le16 name_len;
	__u8   name[0];
	/* name goes here */
} __attribute__ ((__packed__));

Extended refs behave differently from ref arrays in several key areas.

Each extended refs is it's own item so there is no ref array (and
therefore no limit on size).

As a result, we must use a different addressing scheme. Extended ref keys
look like:

(inode objectid, BTRFS_INODE_EXTREF_KEY, hash)

Where hash is defined as a function of the parent objectid and link name.

This effectively fixes the limitation, though we have a slightly less
efficient packing of link data. To keep the best of both worlds then, I
implemented the following behavior:

Extended refs don't replace the existing ref array. An inode gets an
extended ref for a given link _only_ after the ref array has been filled.  So
the most common cases shouldn't actually see any difference in performance
or disk usage as they'll never get to the point where we're using an
extended ref.

It's important while reading the patches however that there's still the
possibility that we can have a set of operations that grow out an inode ref
array (adding some extended refs) and then remove only the refs in the
array.  I don't really see this being common but it's a case we always have
to consider when coding these changes.

Right now there is a limitation for extrefs in that we're not handling the
possibility of a hash collision. There are two ways I see we can deal with

We can use a 56-bit hash and keep a generation counter in the lower 8
bits of the offset field.  The cost would be an additional tree search
(between offset <hash>00 and <hash>FF) if we don't find exactly the name we
were looking for.

An alternative solution to dealing with collisions could be to emulate the
dir-item insertion code - specifically something like insert_with_overflow()
which will stuff multiple items under one key. I tend to prefer the idea of
simply including a generation in the key offset however since it maintains
the 1:1 relationship of keys to names which turns out to be much nicer to
code for in my honest opinion. Also none of the code which iterates the tree
looking for refs would have to change as the only difference is in the key
offset and not in the actual item structure.

Testing wise, the patches are in an intermediate state. I've debugged a fair
bit but I'm certain there's gremlins lurking in there.  The basic namespace
operations work well enough (link, unlink, etc).  I've done light testing of
my changes in backref.c by exercising BTRFS_IOC_INO_PATHS.  The changes in
tree-log.c need the most review and testing - I haven't really figured out a
great way to exercise the code in tree-log yet (suggestions would be

Finally, these patches are based off Linux v3.3.

             reply	other threads:[~2012-04-05 20:09 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-05 20:09 Mark Fasheh [this message]
2012-04-05 20:09 ` [PATCH 1/3] " Mark Fasheh
2012-04-12 13:08   ` Jan Schmidt
2012-04-24 22:23     ` Mark Fasheh
2012-04-25 10:19       ` Jan Schmidt
2012-04-05 20:09 ` [PATCH 2/3] " Mark Fasheh
2012-04-12 13:08   ` Jan Schmidt
2012-05-03 23:12     ` Mark Fasheh
2012-05-04 11:39       ` David Sterba
2012-04-12 15:53   ` Jan Schmidt
2012-05-01 18:39     ` Mark Fasheh
2012-04-05 20:09 ` [PATCH 3/3] " Mark Fasheh
2012-04-12 17:59   ` Jan Schmidt
2012-04-12 18:38     ` Jan Schmidt
2012-05-08 22:57     ` Mark Fasheh
2012-05-09 17:02       ` Chris Mason
2012-05-10  8:23         ` Jan Schmidt
2012-05-10 13:35           ` Chris Mason
2012-04-05 21:13 ` [PATCH 0/3] " Jeff Mahoney
2012-04-11 13:11   ` Jan Schmidt
2012-04-11 13:29     ` Jan Schmidt
2012-04-12 16:11     ` Chris Mason
2012-04-12 16:19       ` Mark Fasheh
2012-04-06  1:24 ` Liu Bo
2012-04-06  2:12   ` Liu Bo
2012-05-21 21:46 Mark Fasheh
2012-08-08 18:55 Mark Fasheh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \
    --subject='Re: [PATCH 0/3] btrfs: extended inode refs' \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.