All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] Experimental btrfs send/receive (kernel side)
@ 2012-07-04 13:38 Alexander Block
  2012-07-04 13:38 ` [RFC PATCH 1/7] Btrfs: use _IOR for BTRFS_IOC_SUBVOL_GETFLAGS Alexander Block
                   ` (6 more replies)
  0 siblings, 7 replies; 43+ messages in thread
From: Alexander Block @ 2012-07-04 13:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Alexander Block

Hello all,

This patchset introduces the btrfs send ioctl, which creates a stream
of instructions that can later be replayed to reconstruct the sent
subvolumes/snapshots. Patches for btrfs-progs will follow in a separate
patchset.

Some of you may remember the previous discussions on send/receive. The
original plan was to use ustar/pax as container for the stream, which
was a good format at the beginning as we planned to store extents and
other data as if they were normal files so that btrfs receive could
unpack them correctly to the right places. The advantage was that you
could unpack it with tar and use the contents by hand to some degree.

The type of the stream however has changed to some kind of instructions
stream, as this was the easiest way to handle moves, deletes and 
overwrites corretly. If this stream was stored in ustar/pax format,
we would have no advantages compared to a custom stream format. So I
dropped the ustar/pax format in the middle of development. I may add
a new mode for the ioctl (or a new ioctl?) later that emits the plain 
diff of the parent root and the root to send, instead of instructions.
This could then be used to do something like what was planned at the
beginning. It could also have other uses too. But that's for later.

The stream now consists of millions of create/rename/link/write/clone/
chmod/... instructions which only need to be replayed. No kernel
support is required to replay the stream. The only exception is the
BTRFS_IOC_SET_RECEIVED_SUBVOL call that is performed when btrfs 
receive is done.

btrfs send/receive currently only works on read-only snapshots. There
are ideas in my head floating around to make sending of r/w subvolumes
possible too, but this is for later.

We support full and incremental sending of subvolumes/snapshots.
The ioctl expects an optional list of "clone sources" and an optional
"parent root". The clone sources tell the kernel which subvolumes
can be used to accept clones from when processing file extents. The
parent root tells the kernel which root should be used for the
incremental send. Internally, it does a tree compare between the
send root and the parent root to find the differences. If no parent
is specified, the full tree is sent. The parent root is implicitely
added to the clone sources by btrfs-progs.  The parent root is also 
used for the initial snapshot operation on the receiving side. If no 
parent was specified to brtfs-progs, it will try to find a good one in 
the list of clone sources. This will however only work for snapshots
that were created with this patchset applied (due to the uuid+times
patch). Older snapshots miss parent information and you'll need to
specify a parent by hand.

If you used reflinks or the experimental dedup (found on the list)
before, you will need working cross subvolume reflinks on the
receiving side. The send ioctl tries hard to avoid emitting cross
subvolume reflinks if that is possible, but there is no guarantee
for this. If you specify clone sources by hand, there is also a
high chance that cross subvolume clones are emitted. In general,
I tend to see cross subvolume reflinks as a requirement for btrfs
send/receive.

*WARNING* *WARNING* *WARNING* *WARNING*
btrfs send/receive is experimental. The main usage for send/receive
in the future will probably be backups. If you use it for backups,
you're taking big risks and may end up with unusable backups. Please
do not only count on btrfs send/receive backups!

If you still want to use it, make sure the backups are working and
100% correct. I for example used rsync in dry run mode to ensure
that a stream was received correctly. Simply receive the just sent
Here is the command line that
I used for it:

rsync -aAXvnc --delete /origin/subvol/ /backup-target/subvol/

The -c flag is the most important here, don't remove it just to make
rsync faster. btrfs receive restores the file times 1:1, so rsync
may consider differing files as equal when it doesn't compare by
checksum. If rsync ever prints a file or directory in its output,
you have found a bug in btrfs send/receive. Please report this.

Also, the output format of btrfs send may not be final. I'll try 
hard to not change it too much and keep compability, but as this is a 
very early version, I can't guarantee anything. So please, don't store
the send streams with the assumption that you can still receive them
in a year.

You've been warned...

*END OF WARNING*

Big thanks go to Arne Jansen, David Sterba and Jan Schmidt (sorted by
first name) who helped me a lot with their assistance in IRC and the 
reviews done by them. The code however still needs a lot of review
and testing, so feel welcome to do so :)

You can pick and apply the patches by hand if you want. Don't
forget to also apply the required patches mentioned below. As an
alternative, here is my git repo containing all required patches:

git://github.com/ablock84/linux-btrfs.git (branch send)

The branch is based on 3.5-rc5. I had to split the last patch/commit as
it got over 100k which could be a problem on the mailing list.
My plan for the branch is to do fixes in seperate commits. I won't
send new patches to the list until I have the feeling it's worth for a
full new vX patch. In case we get a new version of the patchset, I will
either rebase the send branch or create a new branch for the new version
with all fixes squashed into the original commits. When btrfs send comes
into mainline, development will continue as with all other btrfs stuff.
If you have the feeling that this is wrong approach, please tell me as
this is the first project where I actively use git and work in such a
big community.

Requirements for this patchset to work properly:
1. At least kernel 3.5-rc5
2. The "Btrfs: don't update atime on RO subvolumes" patch.
   Found in btrfs-next and my repo.
3. Working cross subvolume reflinks. A patch from David Sterba 
   is found in his and my repo.
4. The patch "Btrfs: add helper for tree enumeration" from
   Arne. Found in my repo.
5. The patch "Btrfs: use _IOR for BTRFS_IOC_SUBVOL_GETFLAGS".
   Found in my repo and btrfs-next.
6. All the patches found in this patchset.

Alex.

Alexander Block (6):
  Btrfs: use _IOR for BTRFS_IOC_SUBVOL_GETFLAGS
  Btrfs: make iref_to_path non static
  Btrfs: introduce subvol uuids and times
  Btrfs: add btrfs_compare_trees function
  Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 1)
  Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 2)

Arne Jansen (1):
  Btrfs: add helper for tree enumeration

 fs/btrfs/Makefile      |    2 +-
 fs/btrfs/backref.c     |   10 +-
 fs/btrfs/backref.h     |    4 +
 fs/btrfs/ctree.c       |  499 ++++++
 fs/btrfs/ctree.h       |   61 +
 fs/btrfs/disk-io.c     |    2 +
 fs/btrfs/inode.c       |    4 +
 fs/btrfs/ioctl.c       |   99 +-
 fs/btrfs/ioctl.h       |   25 +-
 fs/btrfs/root-tree.c   |   92 +-
 fs/btrfs/send.c        | 4255 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/send.h        |  130 ++
 fs/btrfs/transaction.c |   17 +
 13 files changed, 5184 insertions(+), 16 deletions(-)
 create mode 100644 fs/btrfs/send.c
 create mode 100644 fs/btrfs/send.h

-- 
1.7.10


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC PATCH 1/7] Btrfs: use _IOR for BTRFS_IOC_SUBVOL_GETFLAGS
  2012-07-04 13:38 [RFC PATCH 0/7] Experimental btrfs send/receive (kernel side) Alexander Block
@ 2012-07-04 13:38 ` Alexander Block
  2012-07-04 13:38 ` [RFC PATCH 2/7] Btrfs: add helper for tree enumeration Alexander Block
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 43+ messages in thread
From: Alexander Block @ 2012-07-04 13:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Alexander Block

We used the wrong ioctl macro for the getflags ioctl before.
As we don't have the set/getflags ioctls in the user space ioctl.h
at the moment, it's safe to fix it now.

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
---
 fs/btrfs/ioctl.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index 497c530..e440aa6 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -339,7 +339,7 @@ struct btrfs_ioctl_get_dev_stats {
 #define BTRFS_IOC_WAIT_SYNC  _IOW(BTRFS_IOCTL_MAGIC, 22, __u64)
 #define BTRFS_IOC_SNAP_CREATE_V2 _IOW(BTRFS_IOCTL_MAGIC, 23, \
 				   struct btrfs_ioctl_vol_args_v2)
-#define BTRFS_IOC_SUBVOL_GETFLAGS _IOW(BTRFS_IOCTL_MAGIC, 25, __u64)
+#define BTRFS_IOC_SUBVOL_GETFLAGS _IOR(BTRFS_IOCTL_MAGIC, 25, __u64)
 #define BTRFS_IOC_SUBVOL_SETFLAGS _IOW(BTRFS_IOCTL_MAGIC, 26, __u64)
 #define BTRFS_IOC_SCRUB _IOWR(BTRFS_IOCTL_MAGIC, 27, \
 			      struct btrfs_ioctl_scrub_args)
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 2/7] Btrfs: add helper for tree enumeration
  2012-07-04 13:38 [RFC PATCH 0/7] Experimental btrfs send/receive (kernel side) Alexander Block
  2012-07-04 13:38 ` [RFC PATCH 1/7] Btrfs: use _IOR for BTRFS_IOC_SUBVOL_GETFLAGS Alexander Block
@ 2012-07-04 13:38 ` Alexander Block
  2012-07-04 13:38 ` [RFC PATCH 3/7] Btrfs: make iref_to_path non static Alexander Block
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 43+ messages in thread
From: Alexander Block @ 2012-07-04 13:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Arne Jansen

From: Arne Jansen <sensille@gmx.net>

Often no exact match is wanted but just the next lower or
higher item. There's a lot of duplicated code throughout
btrfs to deal with the corner cases. This patch adds a
helper function that can facilitate searching.

Signed-off-by: Arne Jansen <sensille@gmx.net>
---
 fs/btrfs/ctree.c |   74 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/ctree.h |    3 +++
 2 files changed, 77 insertions(+)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 15cbc2b..33c8a03 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2724,6 +2724,80 @@ done:
 }
 
 /*
+ * helper to use instead of search slot if no exact match is needed but
+ * instead the next or previous item should be returned.
+ * When find_higher is true, the next higher item is returned, the next lower
+ * otherwise.
+ * When return_any and find_higher are both true, and no higher item is found,
+ * return the next lower instead.
+ * When return_any is true and find_higher is false, and no lower item is found,
+ * return the next higher instead.
+ * It returns 0 if any item is found, 1 if none is found (tree empty), and
+ * < 0 on error
+ */
+int btrfs_search_slot_for_read(struct btrfs_root *root,
+			       struct btrfs_key *key, struct btrfs_path *p,
+			       int find_higher, int return_any)
+{
+	int ret;
+	struct extent_buffer *leaf;
+
+again:
+	ret = btrfs_search_slot(NULL, root, key, p, 0, 0);
+	if (ret <= 0)
+		return ret;
+	/*
+	 * a return value of 1 means the path is at the position where the
+	 * item should be inserted. Normally this is the next bigger item,
+	 * but in case the previous item is the last in a leaf, path points
+	 * to the first free slot in the previous leaf, i.e. at an invalid
+	 * item.
+	 */
+	leaf = p->nodes[0];
+
+	if (find_higher) {
+		if (p->slots[0] >= btrfs_header_nritems(leaf)) {
+			ret = btrfs_next_leaf(root, p);
+			if (ret <= 0)
+				return ret;
+			if (!return_any)
+				return 1;
+			/*
+			 * no higher item found, return the next
+			 * lower instead
+			 */
+			return_any = 0;
+			find_higher = 0;
+			btrfs_release_path(p);
+			goto again;
+		}
+	} else {
+		if (p->slots[0] == 0) {
+			ret = btrfs_prev_leaf(root, p);
+			if (ret < 0)
+				return ret;
+			if (!ret) {
+				p->slots[0] = btrfs_header_nritems(leaf) - 1;
+				return 0;
+			}
+			if (!return_any)
+				return 1;
+			/*
+			 * no lower item found, return the next
+			 * higher instead
+			 */
+			return_any = 0;
+			find_higher = 1;
+			btrfs_release_path(p);
+			goto again;
+		} else {
+			--p->slots[0];
+		}
+	}
+	return 0;
+}
+
+/*
  * adjust the pointers going up the tree, starting at level
  * making sure the right key of each node is points to 'key'.
  * This is used after shifting pointers to the left, so it stops
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index fa5c45b..8cfde93 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2711,6 +2711,9 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root
 		      ins_len, int cow);
 int btrfs_search_old_slot(struct btrfs_root *root, struct btrfs_key *key,
 			  struct btrfs_path *p, u64 time_seq);
+int btrfs_search_slot_for_read(struct btrfs_root *root,
+			       struct btrfs_key *key, struct btrfs_path *p,
+			       int find_higher, int return_any);
 int btrfs_realloc_node(struct btrfs_trans_handle *trans,
 		       struct btrfs_root *root, struct extent_buffer *parent,
 		       int start_slot, int cache_only, u64 *last_ret,
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 3/7] Btrfs: make iref_to_path non static
  2012-07-04 13:38 [RFC PATCH 0/7] Experimental btrfs send/receive (kernel side) Alexander Block
  2012-07-04 13:38 ` [RFC PATCH 1/7] Btrfs: use _IOR for BTRFS_IOC_SUBVOL_GETFLAGS Alexander Block
  2012-07-04 13:38 ` [RFC PATCH 2/7] Btrfs: add helper for tree enumeration Alexander Block
@ 2012-07-04 13:38 ` Alexander Block
  2012-07-04 13:38 ` [RFC PATCH 4/7] Btrfs: introduce subvol uuids and times Alexander Block
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 43+ messages in thread
From: Alexander Block @ 2012-07-04 13:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Alexander Block

Make iref_to_path non static (needed in send) and rename
it to btrfs_iref_to_path

Signed-off-by: Alexander Block <ablock84@googlemail.com>
---
 fs/btrfs/backref.c |   10 +++++-----
 fs/btrfs/backref.h |    4 ++++
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 7301cdb..f642d28 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1122,10 +1122,10 @@ static int inode_ref_info(u64 inum, u64 ioff, struct btrfs_root *fs_root,
  * required for the path to fit into the buffer. in that case, the returned
  * value will be smaller than dest. callers must check this!
  */
-static char *iref_to_path(struct btrfs_root *fs_root, struct btrfs_path *path,
-				struct btrfs_inode_ref *iref,
-				struct extent_buffer *eb_in, u64 parent,
-				char *dest, u32 size)
+char *btrfs_iref_to_path(struct btrfs_root *fs_root, struct btrfs_path *path,
+			 struct btrfs_inode_ref *iref,
+			 struct extent_buffer *eb_in, u64 parent,
+			 char *dest, u32 size)
 {
 	u32 len;
 	int slot;
@@ -1540,7 +1540,7 @@ static int inode_to_path(u64 inum, struct btrfs_inode_ref *iref,
 					ipath->fspath->bytes_left - s_ptr : 0;
 
 	fspath_min = (char *)ipath->fspath->val + (i + 1) * s_ptr;
-	fspath = iref_to_path(ipath->fs_root, ipath->btrfs_path, iref, eb,
+	fspath = btrfs_iref_to_path(ipath->fs_root, ipath->btrfs_path, iref, eb,
 				inum, fspath_min, bytes_left);
 	if (IS_ERR(fspath))
 		return PTR_ERR(fspath);
diff --git a/fs/btrfs/backref.h b/fs/btrfs/backref.h
index c18d8ac..1a76579 100644
--- a/fs/btrfs/backref.h
+++ b/fs/btrfs/backref.h
@@ -21,6 +21,7 @@
 
 #include "ioctl.h"
 #include "ulist.h"
+#include "extent_io.h"
 
 #define BTRFS_BACKREF_SEARCH_COMMIT_ROOT ((struct btrfs_trans_handle *)0)
 
@@ -60,6 +61,9 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
 				struct btrfs_fs_info *fs_info, u64 bytenr,
 				u64 delayed_ref_seq, u64 time_seq,
 				struct ulist **roots);
+char *btrfs_iref_to_path(struct btrfs_root *fs_root, struct btrfs_path *path,
+			 struct btrfs_inode_ref *iref, struct extent_buffer *eb,
+			 u64 parent, char *dest, u32 size);
 
 struct btrfs_data_container *init_data_container(u32 total_bytes);
 struct inode_fs_paths *init_ipath(s32 total_bytes, struct btrfs_root *fs_root,
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 4/7] Btrfs: introduce subvol uuids and times
  2012-07-04 13:38 [RFC PATCH 0/7] Experimental btrfs send/receive (kernel side) Alexander Block
                   ` (2 preceding siblings ...)
  2012-07-04 13:38 ` [RFC PATCH 3/7] Btrfs: make iref_to_path non static Alexander Block
@ 2012-07-04 13:38 ` Alexander Block
  2012-07-05 11:51   ` Alexander Block
                     ` (2 more replies)
  2012-07-04 13:38 ` [RFC PATCH 5/7] Btrfs: add btrfs_compare_trees function Alexander Block
                   ` (2 subsequent siblings)
  6 siblings, 3 replies; 43+ messages in thread
From: Alexander Block @ 2012-07-04 13:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Alexander Block

This patch introduces uuids for subvolumes. Each
subvolume has it's own uuid. In case it was snapshotted,
it also contains parent_uuid. In case it was received,
it also contains received_uuid.

It also introduces subvolume ctime/otime/stime/rtime. The
first two are comparable to the times found in inodes. otime
is the origin/creation time and ctime is the change time.
stime/rtime are only valid on received subvolumes.
stime is the time of the subvolume when it was
sent. rtime is the time of the subvolume when it was
received.

Additionally to the times, we have a transid for each
time. They are updated at the same place as the times.

btrfs receive uses stransid and rtransid to find out
if a received subvolume changed in the meantime.

If an older kernel mounts a filesystem with the
extented fields, all fields become invalid. The next
mount with a new kernel will detect this and reset the
fields.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
---
 fs/btrfs/ctree.h       |   43 ++++++++++++++++++++++
 fs/btrfs/disk-io.c     |    2 +
 fs/btrfs/inode.c       |    4 ++
 fs/btrfs/ioctl.c       |   96 ++++++++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/ioctl.h       |   13 +++++++
 fs/btrfs/root-tree.c   |   92 +++++++++++++++++++++++++++++++++++++++++++---
 fs/btrfs/transaction.c |   17 +++++++++
 7 files changed, 258 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8cfde93..2bd5df8 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -709,6 +709,35 @@ struct btrfs_root_item {
 	struct btrfs_disk_key drop_progress;
 	u8 drop_level;
 	u8 level;
+
+	/*
+	 * The following fields appear after subvol_uuids+subvol_times
+	 * were introduced.
+	 */
+
+	/*
+	 * This generation number is used to test if the new fields are valid
+	 * and up to date while reading the root item. Everytime the root item
+	 * is written out, the "generation" field is copied into this field. If
+	 * anyone ever mounted the fs with an older kernel, we will have
+	 * mismatching generation values here and thus must invalidate the
+	 * new fields. See btrfs_update_root and btrfs_find_last_root for
+	 * details.
+	 * the offset of generation_v2 is also used as the start for the memset
+	 * when invalidating the fields.
+	 */
+	__le64 generation_v2;
+	u8 uuid[BTRFS_UUID_SIZE];
+	u8 parent_uuid[BTRFS_UUID_SIZE];
+	u8 received_uuid[BTRFS_UUID_SIZE];
+	__le64 ctransid; /* updated when an inode changes */
+	__le64 otransid; /* trans when created */
+	__le64 stransid; /* trans when sent. non-zero for received subvol */
+	__le64 rtransid; /* trans when received. non-zero for received subvol */
+	struct btrfs_timespec ctime;
+	struct btrfs_timespec otime;
+	struct btrfs_timespec stime;
+	struct btrfs_timespec rtime;
 } __attribute__ ((__packed__));
 
 /*
@@ -1416,6 +1445,8 @@ struct btrfs_root {
 	dev_t anon_dev;
 
 	int force_cow;
+
+	spinlock_t root_times_lock;
 };
 
 struct btrfs_ioctl_defrag_range_args {
@@ -2189,6 +2220,16 @@ BTRFS_SETGET_STACK_FUNCS(root_used, struct btrfs_root_item, bytes_used, 64);
 BTRFS_SETGET_STACK_FUNCS(root_limit, struct btrfs_root_item, byte_limit, 64);
 BTRFS_SETGET_STACK_FUNCS(root_last_snapshot, struct btrfs_root_item,
 			 last_snapshot, 64);
+BTRFS_SETGET_STACK_FUNCS(root_generation_v2, struct btrfs_root_item,
+			 generation_v2, 64);
+BTRFS_SETGET_STACK_FUNCS(root_ctransid, struct btrfs_root_item,
+			 ctransid, 64);
+BTRFS_SETGET_STACK_FUNCS(root_otransid, struct btrfs_root_item,
+			 otransid, 64);
+BTRFS_SETGET_STACK_FUNCS(root_stransid, struct btrfs_root_item,
+			 stransid, 64);
+BTRFS_SETGET_STACK_FUNCS(root_rtransid, struct btrfs_root_item,
+			 rtransid, 64);
 
 static inline bool btrfs_root_readonly(struct btrfs_root *root)
 {
@@ -2829,6 +2870,8 @@ int btrfs_find_orphan_roots(struct btrfs_root *tree_root);
 void btrfs_set_root_node(struct btrfs_root_item *item,
 			 struct extent_buffer *node);
 void btrfs_check_and_init_root_item(struct btrfs_root_item *item);
+void btrfs_update_root_times(struct btrfs_trans_handle *trans,
+			     struct btrfs_root *root);
 
 /* dir-item.c */
 int btrfs_insert_dir_item(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7b845ff..d3b49ad 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1182,6 +1182,8 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	root->defrag_running = 0;
 	root->root_key.objectid = objectid;
 	root->anon_dev = 0;
+
+	spin_lock_init(&root->root_times_lock);
 }
 
 static int __must_check find_and_setup_root(struct btrfs_root *tree_root,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 139be17..0f6a65d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2734,6 +2734,8 @@ noinline int btrfs_update_inode(struct btrfs_trans_handle *trans,
 	 */
 	if (!btrfs_is_free_space_inode(root, inode)
 	    && root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID) {
+		btrfs_update_root_times(trans, root);
+
 		ret = btrfs_delayed_update_inode(trans, root, inode);
 		if (!ret)
 			btrfs_set_inode_last_trans(trans, inode);
@@ -4728,6 +4730,8 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans,
 	trace_btrfs_inode_new(inode);
 	btrfs_set_inode_last_trans(trans, inode);
 
+	btrfs_update_root_times(trans, root);
+
 	return inode;
 fail:
 	if (dir)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 7011871..8d258cb 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -41,6 +41,7 @@
 #include <linux/vmalloc.h>
 #include <linux/slab.h>
 #include <linux/blkdev.h>
+#include <linux/uuid.h>
 #include "compat.h"
 #include "ctree.h"
 #include "disk-io.h"
@@ -346,11 +347,13 @@ static noinline int create_subvol(struct btrfs_root *root,
 	struct btrfs_root *new_root;
 	struct dentry *parent = dentry->d_parent;
 	struct inode *dir;
+	struct timespec cur_time = CURRENT_TIME;
 	int ret;
 	int err;
 	u64 objectid;
 	u64 new_dirid = BTRFS_FIRST_FREE_OBJECTID;
 	u64 index = 0;
+	uuid_le new_uuid;
 
 	ret = btrfs_find_free_objectid(root->fs_info->tree_root, &objectid);
 	if (ret)
@@ -389,8 +392,9 @@ static noinline int create_subvol(struct btrfs_root *root,
 			    BTRFS_UUID_SIZE);
 	btrfs_mark_buffer_dirty(leaf);
 
+	memset(&root_item, 0, sizeof(root_item));
+
 	inode_item = &root_item.inode;
-	memset(inode_item, 0, sizeof(*inode_item));
 	inode_item->generation = cpu_to_le64(1);
 	inode_item->size = cpu_to_le64(3);
 	inode_item->nlink = cpu_to_le32(1);
@@ -408,8 +412,15 @@ static noinline int create_subvol(struct btrfs_root *root,
 	btrfs_set_root_used(&root_item, leaf->len);
 	btrfs_set_root_last_snapshot(&root_item, 0);
 
-	memset(&root_item.drop_progress, 0, sizeof(root_item.drop_progress));
-	root_item.drop_level = 0;
+	btrfs_set_root_generation_v2(&root_item,
+			btrfs_root_generation(&root_item));
+	uuid_le_gen(&new_uuid);
+	memcpy(root_item.uuid, new_uuid.b, BTRFS_UUID_SIZE);
+	root_item.otime.sec = cpu_to_le64(cur_time.tv_sec);
+	root_item.otime.nsec = cpu_to_le64(cur_time.tv_nsec);
+	root_item.ctime = root_item.otime;
+	btrfs_set_root_ctransid(&root_item, trans->transid);
+	btrfs_set_root_otransid(&root_item, trans->transid);
 
 	btrfs_tree_unlock(leaf);
 	free_extent_buffer(leaf);
@@ -3395,6 +3406,83 @@ out:
 	return ret;
 }
 
+static long btrfs_ioctl_set_received_subvol(struct file *file,
+					    void __user *arg)
+{
+	struct btrfs_ioctl_received_subvol_args *sa = NULL;
+	struct inode *inode = fdentry(file)->d_inode;
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_root_item *root_item = &root->root_item;
+	struct btrfs_trans_handle *trans;
+	int ret = 0;
+
+	ret = mnt_want_write_file(file);
+	if (ret < 0)
+		return ret;
+
+	down_write(&root->fs_info->subvol_sem);
+
+	if (btrfs_ino(inode) != BTRFS_FIRST_FREE_OBJECTID) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (btrfs_root_readonly(root)) {
+		ret = -EROFS;
+		goto out;
+	}
+
+	if (!inode_owner_or_capable(inode)) {
+		ret = -EACCES;
+		goto out;
+	}
+
+	sa = memdup_user(arg, sizeof(*sa));
+	if (IS_ERR(sa)) {
+		ret = PTR_ERR(sa);
+		sa = NULL;
+		goto out;
+	}
+
+	trans = btrfs_start_transaction(root, 1);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		trans = NULL;
+		goto out;
+	}
+
+	sa->rtransid = trans->transid;
+	sa->rtime = CURRENT_TIME;
+
+	memcpy(root_item->received_uuid, sa->uuid, BTRFS_UUID_SIZE);
+	btrfs_set_root_stransid(root_item, sa->stransid);
+	btrfs_set_root_rtransid(root_item, sa->rtransid);
+	root_item->stime.sec = cpu_to_le64(sa->stime.tv_sec);
+	root_item->stime.nsec = cpu_to_le64(sa->stime.tv_nsec);
+	root_item->rtime.sec = cpu_to_le64(sa->rtime.tv_sec);
+	root_item->rtime.nsec = cpu_to_le64(sa->rtime.tv_nsec);
+
+	ret = btrfs_update_root(trans, root->fs_info->tree_root,
+				&root->root_key, &root->root_item);
+	if (ret < 0) {
+		goto out;
+	} else {
+		ret = btrfs_commit_transaction(trans, root);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = copy_to_user(arg, sa, sizeof(*sa));
+	if (ret)
+		ret = -EFAULT;
+
+out:
+	kfree(sa);
+	up_write(&root->fs_info->subvol_sem);
+	mnt_drop_write_file(file);
+	return ret;
+}
+
 long btrfs_ioctl(struct file *file, unsigned int
 		cmd, unsigned long arg)
 {
@@ -3477,6 +3565,8 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_balance_ctl(root, arg);
 	case BTRFS_IOC_BALANCE_PROGRESS:
 		return btrfs_ioctl_balance_progress(root, argp);
+	case BTRFS_IOC_SET_RECEIVED_SUBVOL:
+		return btrfs_ioctl_set_received_subvol(file, argp);
 	case BTRFS_IOC_GET_DEV_STATS:
 		return btrfs_ioctl_get_dev_stats(root, argp, 0);
 	case BTRFS_IOC_GET_AND_RESET_DEV_STATS:
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index e440aa6..c9e3fac 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -295,6 +295,15 @@ struct btrfs_ioctl_get_dev_stats {
 	__u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */
 };
 
+struct btrfs_ioctl_received_subvol_args {
+	char	uuid[BTRFS_UUID_SIZE];	/* in */
+	__u64	stransid;		/* in */
+	__u64	rtransid;		/* out */
+	struct timespec stime;		/* in */
+	struct timespec rtime;		/* out */
+	__u64	reserved[16];
+};
+
 #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
 				   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
@@ -359,6 +368,10 @@ struct btrfs_ioctl_get_dev_stats {
 					struct btrfs_ioctl_ino_path_args)
 #define BTRFS_IOC_LOGICAL_INO _IOWR(BTRFS_IOCTL_MAGIC, 36, \
 					struct btrfs_ioctl_ino_path_args)
+
+#define BTRFS_IOC_SET_RECEIVED_SUBVOL _IOWR(BTRFS_IOCTL_MAGIC, 37, \
+				struct btrfs_ioctl_received_subvol_args)
+
 #define BTRFS_IOC_GET_DEV_STATS _IOWR(BTRFS_IOCTL_MAGIC, 52, \
 				      struct btrfs_ioctl_get_dev_stats)
 #define BTRFS_IOC_GET_AND_RESET_DEV_STATS _IOWR(BTRFS_IOCTL_MAGIC, 53, \
diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
index 24fb8ce..17d638e 100644
--- a/fs/btrfs/root-tree.c
+++ b/fs/btrfs/root-tree.c
@@ -16,6 +16,7 @@
  * Boston, MA 021110-1307, USA.
  */
 
+#include <linux/uuid.h>
 #include "ctree.h"
 #include "transaction.h"
 #include "disk-io.h"
@@ -25,6 +26,9 @@
  * lookup the root with the highest offset for a given objectid.  The key we do
  * find is copied into 'key'.  If we find something return 0, otherwise 1, < 0
  * on error.
+ * We also check if the root was once mounted with an older kernel. If we detect
+ * this, the new fields coming after 'level' get overwritten with zeros so to
+ * invalidate the fields.
  */
 int btrfs_find_last_root(struct btrfs_root *root, u64 objectid,
 			struct btrfs_root_item *item, struct btrfs_key *key)
@@ -35,6 +39,9 @@ int btrfs_find_last_root(struct btrfs_root *root, u64 objectid,
 	struct extent_buffer *l;
 	int ret;
 	int slot;
+	int len;
+	int need_reset = 0;
+	uuid_le uuid;
 
 	search_key.objectid = objectid;
 	search_key.type = BTRFS_ROOT_ITEM_KEY;
@@ -60,11 +67,36 @@ int btrfs_find_last_root(struct btrfs_root *root, u64 objectid,
 		ret = 1;
 		goto out;
 	}
-	if (item)
+	if (item) {
+		len = btrfs_item_size_nr(l, slot);
 		read_extent_buffer(l, item, btrfs_item_ptr_offset(l, slot),
-				   sizeof(*item));
+				min_t(int, len, (int)sizeof(*item)));
+		if (len < sizeof(*item))
+			need_reset = 1;
+		if (!need_reset && btrfs_root_generation(item)
+			!= btrfs_root_generation_v2(item)) {
+			if (btrfs_root_generation_v2(item) != 0) {
+				printk(KERN_WARNING "btrfs: mismatching "
+						"generation and generation_v2 "
+						"found in root item. This root "
+						"was probably mounted with an "
+						"older kernel. Resetting all "
+						"new fields.\n");
+			}
+			need_reset = 1;
+		}
+		if (need_reset) {
+			memset(&item->generation_v2, 0,
+				sizeof(*item) - offsetof(struct btrfs_root_item,
+						generation_v2));
+
+			uuid_le_gen(&uuid);
+			memcpy(item->uuid, uuid.b, BTRFS_UUID_SIZE);
+		}
+	}
 	if (key)
 		memcpy(key, &found_key, sizeof(found_key));
+
 	ret = 0;
 out:
 	btrfs_free_path(path);
@@ -91,16 +123,15 @@ int btrfs_update_root(struct btrfs_trans_handle *trans, struct btrfs_root
 	int ret;
 	int slot;
 	unsigned long ptr;
+	int old_len;
 
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
 
 	ret = btrfs_search_slot(trans, root, key, path, 0, 1);
-	if (ret < 0) {
-		btrfs_abort_transaction(trans, root, ret);
-		goto out;
-	}
+	if (ret < 0)
+		goto out_abort;
 
 	if (ret != 0) {
 		btrfs_print_leaf(root, path->nodes[0]);
@@ -113,11 +144,47 @@ int btrfs_update_root(struct btrfs_trans_handle *trans, struct btrfs_root
 	l = path->nodes[0];
 	slot = path->slots[0];
 	ptr = btrfs_item_ptr_offset(l, slot);
+	old_len = btrfs_item_size_nr(l, slot);
+
+	/*
+	 * If this is the first time we update the root item which originated
+	 * from an older kernel, we need to enlarge the item size to make room
+	 * for the added fields.
+	 */
+	if (old_len < sizeof(*item)) {
+		btrfs_release_path(path);
+		ret = btrfs_search_slot(trans, root, key, path,
+				-1, 1);
+		if (ret < 0)
+			goto out_abort;
+		ret = btrfs_del_item(trans, root, path);
+		if (ret < 0)
+			goto out_abort;
+		btrfs_release_path(path);
+		ret = btrfs_insert_empty_item(trans, root, path,
+				key, sizeof(*item));
+		if (ret < 0)
+			goto out_abort;
+		l = path->nodes[0];
+		slot = path->slots[0];
+		ptr = btrfs_item_ptr_offset(l, slot);
+	}
+
+	/*
+	 * Update generation_v2 so at the next mount we know the new root
+	 * fields are valid.
+	 */
+	btrfs_set_root_generation_v2(item, btrfs_root_generation(item));
+
 	write_extent_buffer(l, item, ptr, sizeof(*item));
 	btrfs_mark_buffer_dirty(path->nodes[0]);
 out:
 	btrfs_free_path(path);
 	return ret;
+
+out_abort:
+	btrfs_abort_transaction(trans, root, ret);
+	goto out;
 }
 
 int btrfs_insert_root(struct btrfs_trans_handle *trans, struct btrfs_root *root,
@@ -454,3 +521,16 @@ void btrfs_check_and_init_root_item(struct btrfs_root_item *root_item)
 		root_item->byte_limit = 0;
 	}
 }
+
+void btrfs_update_root_times(struct btrfs_trans_handle *trans,
+			     struct btrfs_root *root)
+{
+	struct btrfs_root_item *item = &root->root_item;
+	struct timespec ct = CURRENT_TIME;
+
+	spin_lock(&root->root_times_lock);
+	item->ctransid = trans->transid;
+	item->ctime.sec = cpu_to_le64(ct.tv_sec);
+	item->ctime.nsec = cpu_to_le64(ct.tv_nsec);
+	spin_unlock(&root->root_times_lock);
+}
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index b72b068..a21f308 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -22,6 +22,7 @@
 #include <linux/writeback.h>
 #include <linux/pagemap.h>
 #include <linux/blkdev.h>
+#include <linux/uuid.h>
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -926,11 +927,13 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
 	struct dentry *dentry;
 	struct extent_buffer *tmp;
 	struct extent_buffer *old;
+	struct timespec cur_time = CURRENT_TIME;
 	int ret;
 	u64 to_reserve = 0;
 	u64 index = 0;
 	u64 objectid;
 	u64 root_flags;
+	uuid_le new_uuid;
 
 	rsv = trans->block_rsv;
 
@@ -1016,6 +1019,20 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
 		root_flags &= ~BTRFS_ROOT_SUBVOL_RDONLY;
 	btrfs_set_root_flags(new_root_item, root_flags);
 
+	btrfs_set_root_generation_v2(new_root_item,
+			trans->transid);
+	uuid_le_gen(&new_uuid);
+	memcpy(new_root_item->uuid, new_uuid.b, BTRFS_UUID_SIZE);
+	memcpy(new_root_item->parent_uuid, root->root_item.uuid,
+			BTRFS_UUID_SIZE);
+	new_root_item->otime.sec = cpu_to_le64(cur_time.tv_sec);
+	new_root_item->otime.nsec = cpu_to_le64(cur_time.tv_nsec);
+	btrfs_set_root_otransid(new_root_item, trans->transid);
+	memset(&new_root_item->stime, 0, sizeof(new_root_item->stime));
+	memset(&new_root_item->rtime, 0, sizeof(new_root_item->rtime));
+	btrfs_set_root_stransid(new_root_item, 0);
+	btrfs_set_root_rtransid(new_root_item, 0);
+
 	old = btrfs_lock_root_node(root);
 	ret = btrfs_cow_block(trans, root, old, NULL, 0, &old);
 	if (ret) {
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 5/7] Btrfs: add btrfs_compare_trees function
  2012-07-04 13:38 [RFC PATCH 0/7] Experimental btrfs send/receive (kernel side) Alexander Block
                   ` (3 preceding siblings ...)
  2012-07-04 13:38 ` [RFC PATCH 4/7] Btrfs: introduce subvol uuids and times Alexander Block
@ 2012-07-04 13:38 ` Alexander Block
  2012-07-04 18:27   ` Alex Lyakas
  2012-07-04 19:13   ` Alex Lyakas
  2012-07-04 13:38 ` [RFC PATCH 6/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 1) Alexander Block
  2012-07-04 13:38 ` [RFC PATCH 7/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 2) Alexander Block
  6 siblings, 2 replies; 43+ messages in thread
From: Alexander Block @ 2012-07-04 13:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Alexander Block

This function is used to find the differences between
two trees. The tree compare skips whole subtrees if it
detects shared tree blocks and thus is pretty fast.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
---
 fs/btrfs/ctree.c |  425 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/ctree.h |   15 ++
 2 files changed, 440 insertions(+)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 33c8a03..d1c7efd 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -5007,6 +5007,431 @@ out:
 	return ret;
 }
 
+static void tree_move_down(struct btrfs_root *root,
+			   struct btrfs_path *path,
+			   int *level, int root_level)
+{
+	path->nodes[*level - 1] = read_node_slot(root, path->nodes[*level],
+					path->slots[*level]);
+	path->slots[*level - 1] = 0;
+	(*level)--;
+}
+
+static int tree_move_next_or_upnext(struct btrfs_root *root,
+				    struct btrfs_path *path,
+				    int *level, int root_level)
+{
+	int ret = 0;
+	int nritems;
+	nritems = btrfs_header_nritems(path->nodes[*level]);
+
+	path->slots[*level]++;
+
+	while (path->slots[*level] == nritems) {
+		if (*level == root_level)
+			return -1;
+
+		/* move upnext */
+		path->slots[*level] = 0;
+		free_extent_buffer(path->nodes[*level]);
+		path->nodes[*level] = NULL;
+		(*level)++;
+		path->slots[*level]++;
+
+		nritems = btrfs_header_nritems(path->nodes[*level]);
+		ret = 1;
+	}
+	return ret;
+}
+
+/*
+ * Returns 1 if it had to move up and next. 0 is returned if it moved only next
+ * or down.
+ */
+static int tree_advance(struct btrfs_root *root,
+			struct btrfs_path *path,
+			int *level, int root_level,
+			int allow_down,
+			struct btrfs_key *key)
+{
+	int ret;
+
+	if (*level == 0 || !allow_down) {
+		ret = tree_move_next_or_upnext(root, path, level, root_level);
+	} else {
+		tree_move_down(root, path, level, root_level);
+		ret = 0;
+	}
+	if (ret >= 0) {
+		if (*level == 0)
+			btrfs_item_key_to_cpu(path->nodes[*level], key,
+					path->slots[*level]);
+		else
+			btrfs_node_key_to_cpu(path->nodes[*level], key,
+					path->slots[*level]);
+	}
+	return ret;
+}
+
+static int tree_compare_item(struct btrfs_root *left_root,
+			     struct btrfs_path *left_path,
+			     struct btrfs_path *right_path,
+			     char *tmp_buf)
+{
+	int cmp;
+	int len1, len2;
+	unsigned long off1, off2;
+
+	len1 = btrfs_item_size_nr(left_path->nodes[0], left_path->slots[0]);
+	len2 = btrfs_item_size_nr(right_path->nodes[0], right_path->slots[0]);
+	if (len1 != len2)
+		return 1;
+
+	off1 = btrfs_item_ptr_offset(left_path->nodes[0], left_path->slots[0]);
+	off2 = btrfs_item_ptr_offset(right_path->nodes[0],
+				right_path->slots[0]);
+
+	read_extent_buffer(left_path->nodes[0], tmp_buf, off1, len1);
+
+	cmp = memcmp_extent_buffer(right_path->nodes[0], tmp_buf, off2, len1);
+	if (cmp)
+		return 1;
+	return 0;
+}
+
+#define ADVANCE 1
+#define ADVANCE_ONLY_NEXT -1
+
+/*
+ * This function compares two trees and calls the provided callback for
+ * every changed/new/deleted item it finds.
+ * If shared tree blocks are encountered, whole subtrees are skipped, making
+ * the compare pretty fast on snapshotted subvolumes.
+ *
+ * This currently works on commit roots only. As commit roots are read only,
+ * we don't do any locking. The commit roots are protected with transactions.
+ * Transactions are ended and rejoined when a commit is tried in between.
+ *
+ * This function checks for modifications done to the trees while comparing.
+ * If it detects a change, it aborts immediately.
+ */
+int btrfs_compare_trees(struct btrfs_root *left_root,
+			struct btrfs_root *right_root,
+			btrfs_changed_cb_t changed_cb, void *ctx)
+{
+	int ret;
+	int cmp;
+	struct btrfs_trans_handle *trans = NULL;
+	struct btrfs_path *left_path = NULL;
+	struct btrfs_path *right_path = NULL;
+	struct btrfs_key left_key;
+	struct btrfs_key right_key;
+	char *tmp_buf = NULL;
+	int left_root_level;
+	int right_root_level;
+	int left_level;
+	int right_level;
+	int left_end_reached;
+	int right_end_reached;
+	int advance_left;
+	int advance_right;
+	u64 left_blockptr;
+	u64 right_blockptr;
+	u64 left_start_ctransid;
+	u64 right_start_ctransid;
+	u64 ctransid;
+
+	left_path = btrfs_alloc_path();
+	if (!left_path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	right_path = btrfs_alloc_path();
+	if (!right_path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	tmp_buf = kmalloc(left_root->leafsize, GFP_NOFS);
+	if (!tmp_buf) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	left_path->search_commit_root = 1;
+	left_path->skip_locking = 1;
+	right_path->search_commit_root = 1;
+	right_path->skip_locking = 1;
+
+	spin_lock(&left_root->root_times_lock);
+	left_start_ctransid = btrfs_root_ctransid(&left_root->root_item);
+	spin_unlock(&left_root->root_times_lock);
+
+	spin_lock(&right_root->root_times_lock);
+	right_start_ctransid = btrfs_root_ctransid(&right_root->root_item);
+	spin_unlock(&right_root->root_times_lock);
+
+	trans = btrfs_join_transaction(left_root);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		trans = NULL;
+		goto out;
+	}
+
+	/*
+	 * Strategy: Go to the first items of both trees. Then do
+	 *
+	 * If both trees are at level 0
+	 *   Compare keys of current items
+	 *     If left < right treat left item as new, advance left tree
+	 *       and repeat
+	 *     If left > right treat right item as deleted, advance right tree
+	 *       and repeat
+	 *     If left == right do deep compare of items, treat as changed if
+	 *       needed, advance both trees and repeat
+	 * If both trees are at the same level but not at level 0
+	 *   Compare keys of current nodes/leafs
+	 *     If left < right advance left tree and repeat
+	 *     If left > right advance right tree and repeat
+	 *     If left == right compare blockptrs of the next nodes/leafs
+	 *       If they match advance both trees but stay at the same level
+	 *         and repeat
+	 *       If they don't match advance both trees while allowing to go
+	 *         deeper and repeat
+	 * If tree levels are different
+	 *   Advance the tree that needs it and repeat
+	 *
+	 * Advancing a tree means:
+	 *   If we are at level 0, try to go to the next slot. If that's not
+	 *   possible, go one level up and repeat. Stop when we found a level
+	 *   where we could go to the next slot. We may at this point be on a
+	 *   node or a leaf.
+	 *
+	 *   If we are not at level 0 and not on shared tree blocks, go one
+	 *   level deeper.
+	 *
+	 *   If we are not at level 0 and on shared tree blocks, go one slot to
+	 *   the right if possible or go up and right.
+	 */
+
+	left_level = btrfs_header_level(left_root->commit_root);
+	left_root_level = left_level;
+	left_path->nodes[left_level] = left_root->commit_root;
+	extent_buffer_get(left_path->nodes[left_level]);
+
+	right_level = btrfs_header_level(right_root->commit_root);
+	right_root_level = right_level;
+	right_path->nodes[right_level] = right_root->commit_root;
+	extent_buffer_get(right_path->nodes[right_level]);
+
+	if (left_level == 0)
+		btrfs_item_key_to_cpu(left_path->nodes[left_level],
+				&left_key, left_path->slots[left_level]);
+	else
+		btrfs_node_key_to_cpu(left_path->nodes[left_level],
+				&left_key, left_path->slots[left_level]);
+	if (right_level == 0)
+		btrfs_item_key_to_cpu(right_path->nodes[right_level],
+				&right_key, right_path->slots[right_level]);
+	else
+		btrfs_node_key_to_cpu(right_path->nodes[right_level],
+				&right_key, right_path->slots[right_level]);
+
+	left_end_reached = right_end_reached = 0;
+	advance_left = advance_right = 0;
+
+	while (1) {
+		/*
+		 * We need to make sure the transaction does not get committed
+		 * while we do anything on commit roots. This means, we need to
+		 * join and leave transactions for every item that we process.
+		 */
+		if (trans && btrfs_should_end_transaction(trans, left_root)) {
+			btrfs_release_path(left_path);
+			btrfs_release_path(right_path);
+
+			ret = btrfs_end_transaction(trans, left_root);
+			trans = NULL;
+			if (ret < 0)
+				goto out;
+		}
+		/* now rejoin the transaction */
+		if (!trans) {
+			trans = btrfs_join_transaction(left_root);
+			if (IS_ERR(trans)) {
+				ret = PTR_ERR(trans);
+				trans = NULL;
+				goto out;
+			}
+
+			spin_lock(&left_root->root_times_lock);
+			ctransid = btrfs_root_ctransid(&left_root->root_item);
+			spin_unlock(&left_root->root_times_lock);
+			if (ctransid != left_start_ctransid)
+				left_start_ctransid = 0;
+
+			spin_lock(&right_root->root_times_lock);
+			ctransid = btrfs_root_ctransid(&right_root->root_item);
+			spin_unlock(&right_root->root_times_lock);
+			if (ctransid != right_start_ctransid)
+				left_start_ctransid = 0;
+
+			if (!left_start_ctransid || !right_start_ctransid) {
+				WARN(1, KERN_WARNING
+					"btrfs: btrfs_compare_tree detected "
+					"a change in one of the trees while "
+					"iterating. This is probably a "
+					"bug.\n");
+				ret = -EIO;
+				goto out;
+			}
+
+			/*
+			 * the commit root may have changed, so start again
+			 * where we stopped
+			 */
+			left_path->lowest_level = left_level;
+			right_path->lowest_level = right_level;
+			ret = btrfs_search_slot(NULL, left_root,
+					&left_key, left_path, 0, 0);
+			if (ret < 0)
+				goto out;
+			ret = btrfs_search_slot(NULL, right_root,
+					&right_key, right_path, 0, 0);
+			if (ret < 0)
+				goto out;
+		}
+
+		if (advance_left && !left_end_reached) {
+			ret = tree_advance(left_root, left_path, &left_level,
+					left_root_level,
+					advance_left != ADVANCE_ONLY_NEXT,
+					&left_key);
+			if (ret < 0)
+				left_end_reached = ADVANCE;
+			advance_left = 0;
+		}
+		if (advance_right && !right_end_reached) {
+			ret = tree_advance(right_root, right_path, &right_level,
+					right_root_level,
+					advance_right != ADVANCE_ONLY_NEXT,
+					&right_key);
+			if (ret < 0)
+				right_end_reached = ADVANCE;
+			advance_right = 0;
+		}
+
+		if (left_end_reached && right_end_reached) {
+			ret = 0;
+			goto out;
+		} else if (left_end_reached) {
+			if (right_level == 0) {
+				ret = changed_cb(left_root, right_root,
+						left_path, right_path,
+						&right_key,
+						BTRFS_COMPARE_TREE_DELETED,
+						ctx);
+				if (ret < 0)
+					goto out;
+			}
+			advance_right = ADVANCE;
+			continue;
+		} else if (right_end_reached) {
+			if (left_level == 0) {
+				ret = changed_cb(left_root, right_root,
+						left_path, right_path,
+						&left_key,
+						BTRFS_COMPARE_TREE_NEW,
+						ctx);
+				if (ret < 0)
+					goto out;
+			}
+			advance_left = ADVANCE;
+			continue;
+		}
+
+		if (left_level == 0 && right_level == 0) {
+			cmp = btrfs_comp_cpu_keys(&left_key, &right_key);
+			if (cmp < 0) {
+				ret = changed_cb(left_root, right_root,
+						left_path, right_path,
+						&left_key,
+						BTRFS_COMPARE_TREE_NEW,
+						ctx);
+				if (ret < 0)
+					goto out;
+				advance_left = ADVANCE;
+			} else if (cmp > 0) {
+				ret = changed_cb(left_root, right_root,
+						left_path, right_path,
+						&right_key,
+						BTRFS_COMPARE_TREE_DELETED,
+						ctx);
+				if (ret < 0)
+					goto out;
+				advance_right = ADVANCE;
+			} else {
+				ret = tree_compare_item(left_root, left_path,
+						right_path, tmp_buf);
+				if (ret) {
+					ret = changed_cb(left_root, right_root,
+						left_path, right_path,
+						&left_key,
+						BTRFS_COMPARE_TREE_CHANGED,
+						ctx);
+					if (ret < 0)
+						goto out;
+				}
+				advance_left = ADVANCE;
+				advance_right = ADVANCE;
+			}
+		} else if (left_level == right_level) {
+			cmp = btrfs_comp_cpu_keys(&left_key, &right_key);
+			if (cmp < 0) {
+				advance_left = ADVANCE;
+			} else if (cmp > 0) {
+				advance_right = ADVANCE;
+			} else {
+				left_blockptr = btrfs_node_blockptr(
+						left_path->nodes[left_level],
+						left_path->slots[left_level]);
+				right_blockptr = btrfs_node_blockptr(
+						right_path->nodes[right_level],
+						right_path->slots[right_level]);
+				if (left_blockptr == right_blockptr) {
+					/*
+					 * As we're on a shared block, don't
+					 * allow to go deeper.
+					 */
+					advance_left = ADVANCE_ONLY_NEXT;
+					advance_right = ADVANCE_ONLY_NEXT;
+				} else {
+					advance_left = ADVANCE;
+					advance_right = ADVANCE;
+				}
+			}
+		} else if (left_level < right_level) {
+			advance_right = ADVANCE;
+		} else {
+			advance_left = ADVANCE;
+		}
+	}
+
+out:
+	btrfs_free_path(left_path);
+	btrfs_free_path(right_path);
+	kfree(tmp_buf);
+
+	if (trans) {
+		if (!ret)
+			ret = btrfs_end_transaction(trans, left_root);
+		else
+			btrfs_end_transaction(trans, left_root);
+	}
+
+	return ret;
+}
+
 /*
  * this is similar to btrfs_next_leaf, but does not try to preserve
  * and fixup the path.  It looks for and returns the next key in the
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2bd5df8..74f273a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2721,6 +2721,21 @@ int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key *min_key,
 			 struct btrfs_key *max_key,
 			 struct btrfs_path *path, int cache_only,
 			 u64 min_trans);
+enum btrfs_compare_tree_result {
+	BTRFS_COMPARE_TREE_NEW,
+	BTRFS_COMPARE_TREE_DELETED,
+	BTRFS_COMPARE_TREE_CHANGED,
+};
+typedef int (*btrfs_changed_cb_t)(struct btrfs_root *left_root,
+				  struct btrfs_root *right_root,
+				  struct btrfs_path *left_path,
+				  struct btrfs_path *right_path,
+				  struct btrfs_key *key,
+				  enum btrfs_compare_tree_result result,
+				  void *ctx);
+int btrfs_compare_trees(struct btrfs_root *left_root,
+			struct btrfs_root *right_root,
+			btrfs_changed_cb_t cb, void *ctx);
 int btrfs_cow_block(struct btrfs_trans_handle *trans,
 		    struct btrfs_root *root, struct extent_buffer *buf,
 		    struct extent_buffer *parent, int parent_slot,
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 6/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 1)
  2012-07-04 13:38 [RFC PATCH 0/7] Experimental btrfs send/receive (kernel side) Alexander Block
                   ` (4 preceding siblings ...)
  2012-07-04 13:38 ` [RFC PATCH 5/7] Btrfs: add btrfs_compare_trees function Alexander Block
@ 2012-07-04 13:38 ` Alexander Block
  2012-07-18  6:59   ` Arne Jansen
  2012-07-21 10:53   ` Arne Jansen
  2012-07-04 13:38 ` [RFC PATCH 7/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 2) Alexander Block
  6 siblings, 2 replies; 43+ messages in thread
From: Alexander Block @ 2012-07-04 13:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Alexander Block

This patch introduces the BTRFS_IOC_SEND ioctl that is
required for send. It allows btrfs-progs to implement
full and incremental sends. Patches for btrfs-progs will
follow.

I had to split the patch as it got larger then 100k which is
the limit for the mailing list. The first part only contains
the send.h header and the helper functions for TLV handling
and long path name handling and some other helpers. The second
part contains the actual send logic from send.c

Signed-off-by: Alexander Block <ablock84@googlemail.com>
---
 fs/btrfs/Makefile |    2 +-
 fs/btrfs/ioctl.h  |   10 +
 fs/btrfs/send.c   | 1009 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/send.h   |  126 +++++++
 4 files changed, 1146 insertions(+), 1 deletion(-)
 create mode 100644 fs/btrfs/send.c
 create mode 100644 fs/btrfs/send.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 0c4fa2b..f740644 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -8,7 +8,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
 	   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
 	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
-	   reada.o backref.o ulist.o
+	   reada.o backref.o ulist.o send.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index c9e3fac..282bc64 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -304,6 +304,15 @@ struct btrfs_ioctl_received_subvol_args {
 	__u64	reserved[16];
 };
 
+struct btrfs_ioctl_send_args {
+	__s64 send_fd;			/* in */
+	__u64 clone_sources_count;	/* in */
+	__u64 __user *clone_sources;	/* in */
+	__u64 parent_root;		/* in */
+	__u64 flags;			/* in */
+	__u64 reserved[4];		/* in */
+};
+
 #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
 				   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
@@ -371,6 +380,7 @@ struct btrfs_ioctl_received_subvol_args {
 
 #define BTRFS_IOC_SET_RECEIVED_SUBVOL _IOWR(BTRFS_IOCTL_MAGIC, 37, \
 				struct btrfs_ioctl_received_subvol_args)
+#define BTRFS_IOC_SEND _IOW(BTRFS_IOCTL_MAGIC, 38, struct btrfs_ioctl_send_args)
 
 #define BTRFS_IOC_GET_DEV_STATS _IOWR(BTRFS_IOCTL_MAGIC, 52, \
 				      struct btrfs_ioctl_get_dev_stats)
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
new file mode 100644
index 0000000..47a2557
--- /dev/null
+++ b/fs/btrfs/send.c
@@ -0,0 +1,1009 @@
+/*
+ * Copyright (C) 2012 Alexander Block.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/bsearch.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/sort.h>
+#include <linux/mount.h>
+#include <linux/xattr.h>
+#include <linux/posix_acl_xattr.h>
+#include <linux/radix-tree.h>
+#include <linux/crc32c.h>
+
+#include "send.h"
+#include "backref.h"
+#include "locking.h"
+#include "disk-io.h"
+#include "btrfs_inode.h"
+#include "transaction.h"
+
+static int g_verbose = 0;
+
+#define verbose_printk(...) if (g_verbose) printk(__VA_ARGS__)
+
+/*
+ * A fs_path is a helper to dynamically build path names with unknown size.
+ * It reallocates the internal buffer on demand.
+ * It allows fast adding of path elements on the right side (normal path) and
+ * fast adding to the left side (reversed path). A reversed path can also be
+ * unreversed if needed.
+ */
+struct fs_path {
+	union {
+		struct {
+			char *start;
+			char *end;
+			char *prepared;
+
+			char *buf;
+			int buf_len;
+			int reversed:1;
+			int virtual_mem:1;
+			char inline_buf[];
+		};
+		char pad[PAGE_SIZE];
+	};
+};
+#define FS_PATH_INLINE_SIZE \
+	(sizeof(struct fs_path) - offsetof(struct fs_path, inline_buf))
+
+
+/* reused for each extent */
+struct clone_root {
+	struct btrfs_root *root;
+	u64 ino;
+	u64 offset;
+
+	u64 found_refs;
+};
+
+#define SEND_CTX_MAX_NAME_CACHE_SIZE 128
+#define SEND_CTX_NAME_CACHE_CLEAN_SIZE (SEND_CTX_MAX_NAME_CACHE_SIZE * 2)
+
+struct send_ctx {
+	struct file *send_filp;
+	loff_t send_off;
+	char *send_buf;
+	u32 send_size;
+	u32 send_max_size;
+	u64 total_send_size;
+	u64 cmd_send_size[BTRFS_SEND_C_MAX + 1];
+
+	struct vfsmount *mnt;
+
+	struct btrfs_root *send_root;
+	struct btrfs_root *parent_root;
+	struct clone_root *clone_roots;
+	int clone_roots_cnt;
+
+	/* current state of the compare_tree call */
+	struct btrfs_path *left_path;
+	struct btrfs_path *right_path;
+	struct btrfs_key *cmp_key;
+
+	/*
+	 * infos of the currently processed inode. In case of deleted inodes,
+	 * these are the values from the deleted inode.
+	 */
+	u64 cur_ino;
+	u64 cur_inode_gen;
+	int cur_inode_new;
+	int cur_inode_new_gen;
+	int cur_inode_deleted;
+	u64 cur_inode_size;
+	u64 cur_inode_mode;
+
+	u64 send_progress;
+
+	struct list_head new_refs;
+	struct list_head deleted_refs;
+
+	struct radix_tree_root name_cache;
+	struct list_head name_cache_list;
+	int name_cache_size;
+
+	struct file *cur_inode_filp;
+	char *read_buf;
+};
+
+struct name_cache_entry {
+	struct list_head list;
+	struct list_head use_list;
+	u64 ino;
+	u64 gen;
+	u64 parent_ino;
+	u64 parent_gen;
+	int ret;
+	int need_later_update;
+	int name_len;
+	char name[];
+};
+
+static void fs_path_reset(struct fs_path *p)
+{
+	if (p->reversed) {
+		p->start = p->buf + p->buf_len - 1;
+		p->end = p->start;
+		*p->start = 0;
+	} else {
+		p->start = p->buf;
+		p->end = p->start;
+		*p->start = 0;
+	}
+}
+
+static struct fs_path *fs_path_alloc(struct send_ctx *sctx)
+{
+	struct fs_path *p;
+
+	p = kmalloc(sizeof(*p), GFP_NOFS);
+	if (!p)
+		return NULL;
+	p->reversed = 0;
+	p->virtual_mem = 0;
+	p->buf = p->inline_buf;
+	p->buf_len = FS_PATH_INLINE_SIZE;
+	fs_path_reset(p);
+	return p;
+}
+
+static struct fs_path *fs_path_alloc_reversed(struct send_ctx *sctx)
+{
+	struct fs_path *p;
+
+	p = fs_path_alloc(sctx);
+	if (!p)
+		return NULL;
+	p->reversed = 1;
+	fs_path_reset(p);
+	return p;
+}
+
+static void fs_path_free(struct send_ctx *sctx, struct fs_path *p)
+{
+	if (!p)
+		return;
+	if (p->buf != p->inline_buf) {
+		if (p->virtual_mem)
+			vfree(p->buf);
+		else
+			kfree(p->buf);
+	}
+	kfree(p);
+}
+
+static int fs_path_len(struct fs_path *p)
+{
+	return p->end - p->start;
+}
+
+static int fs_path_ensure_buf(struct fs_path *p, int len)
+{
+	char *tmp_buf;
+	int path_len;
+	int old_buf_len;
+
+	len++;
+
+	if (p->buf_len >= len)
+		return 0;
+
+	path_len = p->end - p->start;
+	old_buf_len = p->buf_len;
+	len = PAGE_ALIGN(len);
+
+	if (p->buf == p->inline_buf) {
+		tmp_buf = kmalloc(len, GFP_NOFS);
+		if (!tmp_buf) {
+			tmp_buf = vmalloc(len);
+			if (!tmp_buf)
+				return -ENOMEM;
+			p->virtual_mem = 1;
+		}
+		memcpy(tmp_buf, p->buf, p->buf_len);
+		p->buf = tmp_buf;
+		p->buf_len = len;
+	} else {
+		if (p->virtual_mem) {
+			tmp_buf = vmalloc(len);
+			if (!tmp_buf)
+				return -ENOMEM;
+			memcpy(tmp_buf, p->buf, p->buf_len);
+			vfree(p->buf);
+		} else {
+			tmp_buf = krealloc(p->buf, len, GFP_NOFS);
+			if (!tmp_buf) {
+				tmp_buf = vmalloc(len);
+				if (!tmp_buf)
+					return -ENOMEM;
+				memcpy(tmp_buf, p->buf, p->buf_len);
+				kfree(p->buf);
+				p->virtual_mem = 1;
+			}
+		}
+		p->buf = tmp_buf;
+		p->buf_len = len;
+	}
+	if (p->reversed) {
+		tmp_buf = p->buf + old_buf_len - path_len - 1;
+		p->end = p->buf + p->buf_len - 1;
+		p->start = p->end - path_len;
+		memmove(p->start, tmp_buf, path_len + 1);
+	} else {
+		p->start = p->buf;
+		p->end = p->start + path_len;
+	}
+	return 0;
+}
+
+static int fs_path_prepare_for_add(struct fs_path *p, int name_len)
+{
+	int ret;
+	int new_len;
+
+	new_len = p->end - p->start + name_len;
+	if (p->start != p->end)
+		new_len++;
+	ret = fs_path_ensure_buf(p, new_len);
+	if (ret < 0)
+		goto out;
+
+	if (p->reversed) {
+		if (p->start != p->end)
+			*--p->start = '/';
+		p->start -= name_len;
+		p->prepared = p->start;
+	} else {
+		if (p->start != p->end)
+			*p->end++ = '/';
+		p->prepared = p->end;
+		p->end += name_len;
+		*p->end = 0;
+	}
+
+out:
+	return ret;
+}
+
+static int fs_path_add(struct fs_path *p, char *name, int name_len)
+{
+	int ret;
+
+	ret = fs_path_prepare_for_add(p, name_len);
+	if (ret < 0)
+		goto out;
+	memcpy(p->prepared, name, name_len);
+	p->prepared = NULL;
+
+out:
+	return ret;
+}
+
+static int fs_path_add_path(struct fs_path *p, struct fs_path *p2)
+{
+	int ret;
+
+	ret = fs_path_prepare_for_add(p, p2->end - p2->start);
+	if (ret < 0)
+		goto out;
+	memcpy(p->prepared, p2->start, p2->end - p2->start);
+	p->prepared = NULL;
+
+out:
+	return ret;
+}
+
+static int fs_path_add_from_extent_buffer(struct fs_path *p,
+					  struct extent_buffer *eb,
+					  unsigned long off, int len)
+{
+	int ret;
+
+	ret = fs_path_prepare_for_add(p, len);
+	if (ret < 0)
+		goto out;
+
+	read_extent_buffer(eb, p->prepared, off, len);
+	p->prepared = NULL;
+
+out:
+	return ret;
+}
+
+static int fs_path_copy(struct fs_path *p, struct fs_path *from)
+{
+	int ret;
+
+	p->reversed = from->reversed;
+	fs_path_reset(p);
+
+	ret = fs_path_add_path(p, from);
+
+	return ret;
+}
+
+
+static void fs_path_unreverse(struct fs_path *p)
+{
+	char *tmp;
+	int len;
+
+	if (!p->reversed)
+		return;
+
+	tmp = p->start;
+	len = p->end - p->start;
+	p->start = p->buf;
+	p->end = p->start + len;
+	memmove(p->start, tmp, len + 1);
+	p->reversed = 0;
+}
+
+static struct btrfs_path *alloc_path_for_send(void)
+{
+	struct btrfs_path *path;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return NULL;
+	path->search_commit_root = 1;
+	path->skip_locking = 1;
+	return path;
+}
+
+static int write_buf(struct send_ctx *sctx, const void *buf, u32 len)
+{
+	int ret;
+	mm_segment_t old_fs;
+	u32 pos = 0;
+
+	old_fs = get_fs();
+	set_fs(KERNEL_DS);
+
+	while (pos < len) {
+		ret = vfs_write(sctx->send_filp, (char *)buf + pos, len - pos,
+				&sctx->send_off);
+		/* TODO handle that correctly */
+		/*if (ret == -ERESTARTSYS) {
+			continue;
+		}*/
+		if (ret < 0) {
+			printk("%d\n", ret);
+			goto out;
+		}
+		if (ret == 0) {
+			ret = -EIO;
+			goto out;
+		}
+		pos += ret;
+	}
+
+	ret = 0;
+
+out:
+	set_fs(old_fs);
+	return ret;
+}
+
+static int tlv_put(struct send_ctx *sctx, u16 attr, const void *data, int len)
+{
+	struct btrfs_tlv_header *hdr;
+	int total_len = sizeof(*hdr) + len;
+	int left = sctx->send_max_size - sctx->send_size;
+
+	if (unlikely(left < total_len))
+		return -EOVERFLOW;
+
+	hdr = (struct btrfs_tlv_header *) (sctx->send_buf + sctx->send_size);
+	hdr->tlv_type = cpu_to_le16(attr);
+	hdr->tlv_len = cpu_to_le16(len);
+	memcpy(hdr + 1, data, len);
+	sctx->send_size += total_len;
+
+	return 0;
+}
+
+#if 0
+static int tlv_put_u8(struct send_ctx *sctx, u16 attr, u8 value)
+{
+	return tlv_put(sctx, attr, &value, sizeof(value));
+}
+
+static int tlv_put_u16(struct send_ctx *sctx, u16 attr, u16 v)
+{
+	__le16 tmp = cpu_to_le16(value);
+	return tlv_put(sctx, attr, &tmp, sizeof(tmp));
+}
+
+static int tlv_put_u32(struct send_ctx *sctx, u16 attr, u32 value)
+{
+	__le32 tmp = cpu_to_le32(value);
+	return tlv_put(sctx, attr, &tmp, sizeof(tmp));
+}
+#endif
+
+static int tlv_put_u64(struct send_ctx *sctx, u16 attr, u64 value)
+{
+	__le64 tmp = cpu_to_le64(value);
+	return tlv_put(sctx, attr, &tmp, sizeof(tmp));
+}
+
+static int tlv_put_string(struct send_ctx *sctx, u16 attr,
+			  const char *str, int len)
+{
+	if (len == -1)
+		len = strlen(str);
+	return tlv_put(sctx, attr, str, len);
+}
+
+static int tlv_put_uuid(struct send_ctx *sctx, u16 attr,
+			const u8 *uuid)
+{
+	return tlv_put(sctx, attr, uuid, BTRFS_UUID_SIZE);
+}
+
+#if 0
+static int tlv_put_timespec(struct send_ctx *sctx, u16 attr,
+			    struct timespec *ts)
+{
+	struct btrfs_timespec bts;
+	bts.sec = cpu_to_le64(ts->tv_sec);
+	bts.nsec = cpu_to_le32(ts->tv_nsec);
+	return tlv_put(sctx, attr, &bts, sizeof(bts));
+}
+#endif
+
+static int tlv_put_btrfs_timespec(struct send_ctx *sctx, u16 attr,
+				  struct extent_buffer *eb,
+				  struct btrfs_timespec *ts)
+{
+	struct btrfs_timespec bts;
+	read_extent_buffer(eb, &bts, (unsigned long)ts, sizeof(bts));
+	return tlv_put(sctx, attr, &bts, sizeof(bts));
+}
+
+
+#define TLV_PUT(sctx, attrtype, attrlen, data) \
+	do { \
+		ret = tlv_put(sctx, attrtype, attrlen, data); \
+		if (ret < 0) \
+			goto tlv_put_failure; \
+	} while (0)
+
+#define TLV_PUT_INT(sctx, attrtype, bits, value) \
+	do { \
+		ret = tlv_put_u##bits(sctx, attrtype, value); \
+		if (ret < 0) \
+			goto tlv_put_failure; \
+	} while (0)
+
+#define TLV_PUT_U8(sctx, attrtype, data) TLV_PUT_INT(sctx, attrtype, 8, data)
+#define TLV_PUT_U16(sctx, attrtype, data) TLV_PUT_INT(sctx, attrtype, 16, data)
+#define TLV_PUT_U32(sctx, attrtype, data) TLV_PUT_INT(sctx, attrtype, 32, data)
+#define TLV_PUT_U64(sctx, attrtype, data) TLV_PUT_INT(sctx, attrtype, 64, data)
+#define TLV_PUT_STRING(sctx, attrtype, str, len) \
+	do { \
+		ret = tlv_put_string(sctx, attrtype, str, len); \
+		if (ret < 0) \
+			goto tlv_put_failure; \
+	} while (0)
+#define TLV_PUT_PATH(sctx, attrtype, p) \
+	do { \
+		ret = tlv_put_string(sctx, attrtype, p->start, \
+			p->end - p->start); \
+		if (ret < 0) \
+			goto tlv_put_failure; \
+	} while(0)
+#define TLV_PUT_UUID(sctx, attrtype, uuid) \
+	do { \
+		ret = tlv_put_uuid(sctx, attrtype, uuid); \
+		if (ret < 0) \
+			goto tlv_put_failure; \
+	} while (0)
+#define TLV_PUT_TIMESPEC(sctx, attrtype, ts) \
+	do { \
+		ret = tlv_put_timespec(sctx, attrtype, ts); \
+		if (ret < 0) \
+			goto tlv_put_failure; \
+	} while (0)
+#define TLV_PUT_BTRFS_TIMESPEC(sctx, attrtype, eb, ts) \
+	do { \
+		ret = tlv_put_btrfs_timespec(sctx, attrtype, eb, ts); \
+		if (ret < 0) \
+			goto tlv_put_failure; \
+	} while (0)
+
+static int send_header(struct send_ctx *sctx)
+{
+	int ret;
+	struct btrfs_stream_header hdr;
+
+	strcpy(hdr.magic, BTRFS_SEND_STREAM_MAGIC);
+	hdr.version = cpu_to_le32(BTRFS_SEND_STREAM_VERSION);
+
+	ret = write_buf(sctx, &hdr, sizeof(hdr));
+
+	return ret;
+}
+
+/*
+ * For each command/item we want to send to userspace, we call this function.
+ */
+static int begin_cmd(struct send_ctx *sctx, int cmd)
+{
+	int ret = 0;
+	struct btrfs_cmd_header *hdr;
+
+	if (!sctx->send_buf) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+
+	BUG_ON(!sctx->send_buf);
+	BUG_ON(sctx->send_size);
+
+	sctx->send_size += sizeof(*hdr);
+	hdr = (struct btrfs_cmd_header *)sctx->send_buf;
+	hdr->cmd = cpu_to_le16(cmd);
+
+	return ret;
+}
+
+static int send_cmd(struct send_ctx *sctx)
+{
+	int ret;
+	struct btrfs_cmd_header *hdr;
+	u32 crc;
+
+	hdr = (struct btrfs_cmd_header *)sctx->send_buf;
+	hdr->len = cpu_to_le32(sctx->send_size - sizeof(*hdr));
+	hdr->crc = 0;
+
+	crc = crc32c(0, (unsigned char *)sctx->send_buf, sctx->send_size);
+	hdr->crc = cpu_to_le32(crc);
+
+	ret = write_buf(sctx, sctx->send_buf, sctx->send_size);
+
+	sctx->total_send_size += sctx->send_size;
+	sctx->cmd_send_size[le16_to_cpu(hdr->cmd)] += sctx->send_size;
+	sctx->send_size = 0;
+
+	return ret;
+}
+
+/*
+ * Sends a move instruction to user space
+ */
+static int send_rename(struct send_ctx *sctx,
+		     struct fs_path *from, struct fs_path *to)
+{
+	int ret;
+
+verbose_printk("btrfs: send_rename %s -> %s\n", from->start, to->start);
+
+	ret = begin_cmd(sctx, BTRFS_SEND_C_RENAME);
+	if (ret < 0)
+		goto out;
+
+	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, from);
+	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH_TO, to);
+
+	ret = send_cmd(sctx);
+
+tlv_put_failure:
+out:
+	return ret;
+}
+
+/*
+ * Sends a link instruction to user space
+ */
+static int send_link(struct send_ctx *sctx,
+		     struct fs_path *path, struct fs_path *lnk)
+{
+	int ret;
+
+verbose_printk("btrfs: send_link %s -> %s\n", path->start, lnk->start);
+
+	ret = begin_cmd(sctx, BTRFS_SEND_C_LINK);
+	if (ret < 0)
+		goto out;
+
+	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path);
+	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH_LINK, lnk);
+
+	ret = send_cmd(sctx);
+
+tlv_put_failure:
+out:
+	return ret;
+}
+
+/*
+ * Sends an unlink instruction to user space
+ */
+static int send_unlink(struct send_ctx *sctx, struct fs_path *path)
+{
+	int ret;
+
+verbose_printk("btrfs: send_unlink %s\n", path->start);
+
+	ret = begin_cmd(sctx, BTRFS_SEND_C_UNLINK);
+	if (ret < 0)
+		goto out;
+
+	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path);
+
+	ret = send_cmd(sctx);
+
+tlv_put_failure:
+out:
+	return ret;
+}
+
+/*
+ * Sends a rmdir instruction to user space
+ */
+static int send_rmdir(struct send_ctx *sctx, struct fs_path *path)
+{
+	int ret;
+
+verbose_printk("btrfs: send_rmdir %s\n", path->start);
+
+	ret = begin_cmd(sctx, BTRFS_SEND_C_RMDIR);
+	if (ret < 0)
+		goto out;
+
+	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path);
+
+	ret = send_cmd(sctx);
+
+tlv_put_failure:
+out:
+	return ret;
+}
+
+/*
+ * Helper function to retrieve some fields from an inode item.
+ */
+static int get_inode_info(struct btrfs_root *root,
+			  u64 ino, u64 *size, u64 *gen,
+			  u64 *mode, u64 *uid, u64 *gid)
+{
+	int ret;
+	struct btrfs_inode_item *ii;
+	struct btrfs_key key;
+	struct btrfs_path *path;
+
+	path = alloc_path_for_send();
+	if (!path)
+		return -ENOMEM;
+
+	key.objectid = ino;
+	key.type = BTRFS_INODE_ITEM_KEY;
+	key.offset = 0;
+	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+	if (ret < 0)
+		goto out;
+	if (ret) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	ii = btrfs_item_ptr(path->nodes[0], path->slots[0],
+			struct btrfs_inode_item);
+	if (size)
+		*size = btrfs_inode_size(path->nodes[0], ii);
+	if (gen)
+		*gen = btrfs_inode_generation(path->nodes[0], ii);
+	if (mode)
+		*mode = btrfs_inode_mode(path->nodes[0], ii);
+	if (uid)
+		*uid = btrfs_inode_uid(path->nodes[0], ii);
+	if (gid)
+		*gid = btrfs_inode_gid(path->nodes[0], ii);
+
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+typedef int (*iterate_inode_ref_t)(int num, u64 dir, int index,
+				   struct fs_path *p,
+				   void *ctx);
+
+/*
+ * Helper function to iterate the entries in ONE btrfs_inode_ref.
+ * The iterate callback may return a non zero value to stop iteration. This can
+ * be a negative value for error codes or 1 to simply stop it.
+ *
+ * path must point to the INODE_REF when called.
+ */
+static int iterate_inode_ref(struct send_ctx *sctx,
+			     struct btrfs_root *root, struct btrfs_path *path,
+			     struct btrfs_key *found_key, int resolve,
+			     iterate_inode_ref_t iterate, void *ctx)
+{
+	struct extent_buffer *eb;
+	struct btrfs_item *item;
+	struct btrfs_inode_ref *iref;
+	struct btrfs_path *tmp_path;
+	struct fs_path *p;
+	u32 cur;
+	u32 len;
+	u32 total;
+	int slot;
+	u32 name_len;
+	char *start;
+	int ret = 0;
+	int num;
+	int index;
+
+	p = fs_path_alloc_reversed(sctx);
+	if (!p)
+		return -ENOMEM;
+
+	tmp_path = alloc_path_for_send();
+	if (!tmp_path) {
+		fs_path_free(sctx, p);
+		return -ENOMEM;
+	}
+
+	eb = path->nodes[0];
+	slot = path->slots[0];
+	item = btrfs_item_nr(eb, slot);
+	iref = btrfs_item_ptr(eb, slot, struct btrfs_inode_ref);
+	cur = 0;
+	len = 0;
+	total = btrfs_item_size(eb, item);
+
+	num = 0;
+	while (cur < total) {
+		fs_path_reset(p);
+
+		name_len = btrfs_inode_ref_name_len(eb, iref);
+		index = btrfs_inode_ref_index(eb, iref);
+		if (resolve) {
+			start = btrfs_iref_to_path(root, tmp_path, iref, eb,
+						found_key->offset, p->buf,
+						p->buf_len);
+			if (IS_ERR(start)) {
+				ret = PTR_ERR(start);
+				goto out;
+			}
+			if (start < p->buf) {
+				/* overflow , try again with larger buffer */
+				ret = fs_path_ensure_buf(p,
+						p->buf_len + p->buf - start);
+				if (ret < 0)
+					goto out;
+				start = btrfs_iref_to_path(root, tmp_path, iref,
+						eb, found_key->offset, p->buf,
+						p->buf_len);
+				if (IS_ERR(start)) {
+					ret = PTR_ERR(start);
+					goto out;
+				}
+				BUG_ON(start < p->buf);
+			}
+			p->start = start;
+		} else {
+			ret = fs_path_add_from_extent_buffer(p, eb,
+					(unsigned long)(iref + 1), name_len);
+			if (ret < 0)
+				goto out;
+		}
+
+
+		len = sizeof(*iref) + name_len;
+		iref = (struct btrfs_inode_ref *)((char *)iref + len);
+		cur += len;
+
+		ret = iterate(num, found_key->offset, index, p, ctx);
+		if (ret < 0)
+			goto out;
+		if (ret) {
+			ret = 0;
+			goto out;
+		}
+
+		num++;
+	}
+
+out:
+	btrfs_free_path(tmp_path);
+	fs_path_free(sctx, p);
+	return ret;
+}
+
+typedef int (*iterate_dir_item_t)(int num, const char *name, int name_len,
+				  const char *data, int data_len,
+				  u8 type, void *ctx);
+
+/*
+ * Helper function to iterate the entries in ONE btrfs_dir_item.
+ * The iterate callback may return a non zero value to stop iteration. This can
+ * be a negative value for error codes or 1 to simply stop it.
+ *
+ * path must point to the dir item when called.
+ */
+static int iterate_dir_item(struct send_ctx *sctx,
+			    struct btrfs_root *root, struct btrfs_path *path,
+			    struct btrfs_key *found_key,
+			    iterate_dir_item_t iterate, void *ctx)
+{
+	int ret = 0;
+	struct extent_buffer *eb;
+	struct btrfs_item *item;
+	struct btrfs_dir_item *di;
+	struct btrfs_path *tmp_path = NULL;
+	char *buf = NULL;
+	char *buf2 = NULL;
+	int buf_len;
+	int buf_virtual = 0;
+	u32 name_len;
+	u32 data_len;
+	u32 cur;
+	u32 len;
+	u32 total;
+	int slot;
+	int num;
+	u8 type;
+
+	buf_len = PAGE_SIZE;
+	buf = kmalloc(buf_len, GFP_NOFS);
+	if (!buf) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	tmp_path = alloc_path_for_send();
+	if (!tmp_path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	eb = path->nodes[0];
+	slot = path->slots[0];
+	item = btrfs_item_nr(eb, slot);
+	di = btrfs_item_ptr(eb, slot, struct btrfs_dir_item);
+	cur = 0;
+	len = 0;
+	total = btrfs_item_size(eb, item);
+
+	num = 0;
+	while (cur < total) {
+		name_len = btrfs_dir_name_len(eb, di);
+		data_len = btrfs_dir_data_len(eb, di);
+		type = btrfs_dir_type(eb, di);
+
+		if (name_len + data_len > buf_len) {
+			buf_len = PAGE_ALIGN(name_len + data_len);
+			if (buf_virtual) {
+				buf2 = vmalloc(buf_len);
+				if (!buf2) {
+					ret = -ENOMEM;
+					goto out;
+				}
+				vfree(buf);
+			} else {
+				buf2 = krealloc(buf, buf_len, GFP_NOFS);
+				if (!buf2) {
+					buf2 = vmalloc(buf_len);
+					if (!buf) {
+						ret = -ENOMEM;
+						goto out;
+					}
+					kfree(buf);
+					buf_virtual = 1;
+				}
+			}
+
+			buf = buf2;
+			buf2 = NULL;
+		}
+
+		read_extent_buffer(eb, buf, (unsigned long)(di + 1),
+				name_len + data_len);
+
+		len = sizeof(*di) + name_len + data_len;
+		di = (struct btrfs_dir_item *)((char *)di + len);
+		cur += len;
+
+		ret = iterate(num, buf, name_len, buf + name_len, data_len,
+				type, ctx);
+		if (ret < 0)
+			goto out;
+		if (ret) {
+			ret = 0;
+			goto out;
+		}
+
+		num++;
+	}
+
+out:
+	btrfs_free_path(tmp_path);
+	if (buf_virtual)
+		vfree(buf);
+	else
+		kfree(buf);
+	return ret;
+}
+
+static int __copy_first_ref(int num, u64 dir, int index,
+			    struct fs_path *p, void *ctx)
+{
+	int ret;
+	struct fs_path *pt = ctx;
+
+	ret = fs_path_copy(pt, p);
+	if (ret < 0)
+		return ret;
+
+	/* we want the first only */
+	return 1;
+}
+
+/*
+ * Retrieve the first path of an inode. If an inode has more then one
+ * ref/hardlink, this is ignored.
+ */
+static int get_inode_path(struct send_ctx *sctx, struct btrfs_root *root,
+			  u64 ino, struct fs_path *path)
+{
+	int ret;
+	struct btrfs_key key, found_key;
+	struct btrfs_path *p;
+
+	p = alloc_path_for_send();
+	if (!p)
+		return -ENOMEM;
+
+	fs_path_reset(path);
+
+	key.objectid = ino;
+	key.type = BTRFS_INODE_REF_KEY;
+	key.offset = 0;
+
+	ret = btrfs_search_slot_for_read(root, &key, p, 1, 0);
+	if (ret < 0)
+		goto out;
+	if (ret) {
+		ret = 1;
+		goto out;
+	}
+	btrfs_item_key_to_cpu(p->nodes[0], &found_key, p->slots[0]);
+	if (found_key.objectid != ino ||
+		found_key.type != BTRFS_INODE_REF_KEY) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	ret = iterate_inode_ref(sctx, root, p, &found_key, 1,
+			__copy_first_ref, path);
+	if (ret < 0)
+		goto out;
+	ret = 0;
+
+out:
+	btrfs_free_path(p);
+	return ret;
+}
+
diff --git a/fs/btrfs/send.h b/fs/btrfs/send.h
new file mode 100644
index 0000000..a4c23ee
--- /dev/null
+++ b/fs/btrfs/send.h
@@ -0,0 +1,126 @@
+/*
+ * Copyright (C) 2012 Alexander Block.  All rights reserved.
+ * Copyright (C) 2012 STRATO.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include "ctree.h"
+
+#define BTRFS_SEND_STREAM_MAGIC "btrfs-stream"
+#define BTRFS_SEND_STREAM_VERSION 1
+
+#define BTRFS_SEND_BUF_SIZE (1024 * 64)
+#define BTRFS_SEND_READ_SIZE (1024 * 48)
+
+enum btrfs_tlv_type {
+	BTRFS_TLV_U8,
+	BTRFS_TLV_U16,
+	BTRFS_TLV_U32,
+	BTRFS_TLV_U64,
+	BTRFS_TLV_BINARY,
+	BTRFS_TLV_STRING,
+	BTRFS_TLV_UUID,
+	BTRFS_TLV_TIMESPEC,
+};
+
+struct btrfs_stream_header {
+	char magic[sizeof(BTRFS_SEND_STREAM_MAGIC)];
+	__le32 version;
+} __attribute__ ((__packed__));
+
+struct btrfs_cmd_header {
+	__le32 len;
+	__le16 cmd;
+	__le32 crc;
+} __attribute__ ((__packed__));
+
+struct btrfs_tlv_header {
+	__le16 tlv_type;
+	__le16 tlv_len;
+} __attribute__ ((__packed__));
+
+/* commands */
+enum btrfs_send_cmd {
+	BTRFS_SEND_C_UNSPEC,
+
+	BTRFS_SEND_C_SUBVOL,
+	BTRFS_SEND_C_SNAPSHOT,
+
+	BTRFS_SEND_C_MKFILE,
+	BTRFS_SEND_C_MKDIR,
+	BTRFS_SEND_C_MKNOD,
+	BTRFS_SEND_C_MKFIFO,
+	BTRFS_SEND_C_MKSOCK,
+	BTRFS_SEND_C_SYMLINK,
+
+	BTRFS_SEND_C_RENAME,
+	BTRFS_SEND_C_LINK,
+	BTRFS_SEND_C_UNLINK,
+	BTRFS_SEND_C_RMDIR,
+
+	BTRFS_SEND_C_SET_XATTR,
+	BTRFS_SEND_C_REMOVE_XATTR,
+
+	BTRFS_SEND_C_WRITE,
+	BTRFS_SEND_C_CLONE,
+
+	BTRFS_SEND_C_TRUNCATE,
+	BTRFS_SEND_C_CHMOD,
+	BTRFS_SEND_C_CHOWN,
+	BTRFS_SEND_C_UTIMES,
+
+	BTRFS_SEND_C_END,
+	__BTRFS_SEND_C_MAX,
+};
+#define BTRFS_SEND_C_MAX (__BTRFS_SEND_C_MAX - 1)
+
+/* attributes in send stream */
+enum {
+	BTRFS_SEND_A_UNSPEC,
+
+	BTRFS_SEND_A_UUID,
+	BTRFS_SEND_A_CTRANSID,
+
+	BTRFS_SEND_A_INO,
+	BTRFS_SEND_A_SIZE,
+	BTRFS_SEND_A_MODE,
+	BTRFS_SEND_A_UID,
+	BTRFS_SEND_A_GID,
+	BTRFS_SEND_A_RDEV,
+	BTRFS_SEND_A_CTIME,
+	BTRFS_SEND_A_MTIME,
+	BTRFS_SEND_A_ATIME,
+	BTRFS_SEND_A_OTIME,
+
+	BTRFS_SEND_A_XATTR_NAME,
+	BTRFS_SEND_A_XATTR_DATA,
+
+	BTRFS_SEND_A_PATH,
+	BTRFS_SEND_A_PATH_TO,
+	BTRFS_SEND_A_PATH_LINK,
+
+	BTRFS_SEND_A_FILE_OFFSET,
+	BTRFS_SEND_A_DATA,
+
+	BTRFS_SEND_A_CLONE_UUID,
+	BTRFS_SEND_A_CLONE_CTRANSID,
+	BTRFS_SEND_A_CLONE_PATH,
+	BTRFS_SEND_A_CLONE_OFFSET,
+	BTRFS_SEND_A_CLONE_LEN,
+
+	__BTRFS_SEND_A_MAX,
+};
+#define BTRFS_SEND_A_MAX (__BTRFS_SEND_A_MAX - 1)
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 7/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 2)
  2012-07-04 13:38 [RFC PATCH 0/7] Experimental btrfs send/receive (kernel side) Alexander Block
                   ` (5 preceding siblings ...)
  2012-07-04 13:38 ` [RFC PATCH 6/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 1) Alexander Block
@ 2012-07-04 13:38 ` Alexander Block
  2012-07-10 15:26   ` Alex Lyakas
                     ` (2 more replies)
  6 siblings, 3 replies; 43+ messages in thread
From: Alexander Block @ 2012-07-04 13:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Alexander Block

This is the second part of the splitted BTRFS_IOC_SEND patch which
contains the actual send logic.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
---
 fs/btrfs/ioctl.c |    3 +
 fs/btrfs/send.c  | 3246 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/send.h  |    4 +
 3 files changed, 3253 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 8d258cb..9173867 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -54,6 +54,7 @@
 #include "inode-map.h"
 #include "backref.h"
 #include "rcu-string.h"
+#include "send.h"
 
 /* Mask out flags that are inappropriate for the given type of inode. */
 static inline __u32 btrfs_mask_flags(umode_t mode, __u32 flags)
@@ -3567,6 +3568,8 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_balance_progress(root, argp);
 	case BTRFS_IOC_SET_RECEIVED_SUBVOL:
 		return btrfs_ioctl_set_received_subvol(file, argp);
+	case BTRFS_IOC_SEND:
+		return btrfs_ioctl_send(file, argp);
 	case BTRFS_IOC_GET_DEV_STATS:
 		return btrfs_ioctl_get_dev_stats(root, argp, 0);
 	case BTRFS_IOC_GET_AND_RESET_DEV_STATS:
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 47a2557..4d3fcfc 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -1007,3 +1007,3249 @@ out:
 	return ret;
 }
 
+struct backref_ctx {
+	struct send_ctx *sctx;
+
+	/* number of total found references */
+	u64 found;
+
+	/*
+	 * used for clones found in send_root. clones found behind cur_objectid
+	 * and cur_offset are not considered as allowed clones.
+	 */
+	u64 cur_objectid;
+	u64 cur_offset;
+
+	/* may be truncated in case it's the last extent in a file */
+	u64 extent_len;
+
+	/* Just to check for bugs in backref resolving */
+	int found_in_send_root;
+};
+
+static int __clone_root_cmp_bsearch(const void *key, const void *elt)
+{
+	u64 root = (u64)key;
+	struct clone_root *cr = (struct clone_root *)elt;
+
+	if (root < cr->root->objectid)
+		return -1;
+	if (root > cr->root->objectid)
+		return 1;
+	return 0;
+}
+
+static int __clone_root_cmp_sort(const void *e1, const void *e2)
+{
+	struct clone_root *cr1 = (struct clone_root *)e1;
+	struct clone_root *cr2 = (struct clone_root *)e2;
+
+	if (cr1->root->objectid < cr2->root->objectid)
+		return -1;
+	if (cr1->root->objectid > cr2->root->objectid)
+		return 1;
+	return 0;
+}
+
+/*
+ * Called for every backref that is found for the current extent.
+ */
+static int __iterate_backrefs(u64 ino, u64 offset, u64 root, void *ctx_)
+{
+	struct backref_ctx *bctx = ctx_;
+	struct clone_root *found;
+	int ret;
+	u64 i_size;
+
+	/* First check if the root is in the list of accepted clone sources */
+	found = bsearch((void *)root, bctx->sctx->clone_roots,
+			bctx->sctx->clone_roots_cnt,
+			sizeof(struct clone_root),
+			__clone_root_cmp_bsearch);
+	if (!found)
+		return 0;
+
+	if (found->root == bctx->sctx->send_root &&
+	    ino == bctx->cur_objectid &&
+	    offset == bctx->cur_offset) {
+		bctx->found_in_send_root = 1;
+	}
+
+	/*
+	 * There are inodes that have extents that lie behind it's i_size. Don't
+	 * accept clones from these extents.
+	 */
+	ret = get_inode_info(found->root, ino, &i_size, NULL, NULL, NULL, NULL);
+	if (ret < 0)
+		return ret;
+
+	if (offset + bctx->extent_len > i_size)
+		return 0;
+
+	/*
+	 * Make sure we don't consider clones from send_root that are
+	 * behind the current inode/offset.
+	 */
+	if (found->root == bctx->sctx->send_root) {
+		/*
+		 * TODO for the moment we don't accept clones from the inode
+		 * that is currently send. We may change this when
+		 * BTRFS_IOC_CLONE_RANGE supports cloning from and to the same
+		 * file.
+		 */
+		if (ino >= bctx->cur_objectid)
+			return 0;
+		/*if (ino > ctx->cur_objectid)
+			return 0;
+		if (offset + ctx->extent_len > ctx->cur_offset)
+			return 0;*/
+
+		bctx->found++;
+		found->found_refs++;
+		found->ino = ino;
+		found->offset = offset;
+		return 0;
+	}
+
+	bctx->found++;
+	found->found_refs++;
+	if (ino < found->ino) {
+		found->ino = ino;
+		found->offset = offset;
+	} else if (found->ino == ino) {
+		/*
+		 * same extent found more then once in the same file.
+		 */
+		if (found->offset > offset + bctx->extent_len)
+			found->offset = offset;
+	}
+
+	return 0;
+}
+
+/*
+ * path must point to the extent item when called.
+ */
+static int find_extent_clone(struct send_ctx *sctx,
+			     struct btrfs_path *path,
+			     u64 ino, u64 data_offset,
+			     u64 ino_size,
+			     struct clone_root **found)
+{
+	int ret;
+	int extent_type;
+	u64 logical;
+	u64 num_bytes;
+	u64 extent_item_pos;
+	struct btrfs_file_extent_item *fi;
+	struct extent_buffer *eb = path->nodes[0];
+	struct backref_ctx backref_ctx;
+	struct clone_root *cur_clone_root;
+	struct btrfs_key found_key;
+	struct btrfs_path *tmp_path;
+	u32 i;
+
+	tmp_path = alloc_path_for_send();
+	if (!tmp_path)
+		return -ENOMEM;
+
+	if (data_offset >= ino_size) {
+		/*
+		 * There may be extents that lie behind the file's size.
+		 * I at least had this in combination with snapshotting while
+		 * writing large files.
+		 */
+		ret = 0;
+		goto out;
+	}
+
+	fi = btrfs_item_ptr(eb, path->slots[0],
+			struct btrfs_file_extent_item);
+	extent_type = btrfs_file_extent_type(eb, fi);
+	if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	num_bytes = btrfs_file_extent_num_bytes(eb, fi);
+	logical = btrfs_file_extent_disk_bytenr(eb, fi);
+	if (logical == 0) {
+		ret = -ENOENT;
+		goto out;
+	}
+	logical += btrfs_file_extent_offset(eb, fi);
+
+	ret = extent_from_logical(sctx->send_root->fs_info,
+			logical, tmp_path, &found_key);
+	btrfs_release_path(tmp_path);
+
+	if (ret < 0)
+		goto out;
+	if (ret & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
+		ret = -EIO;
+		goto out;
+	}
+
+	/*
+	 * Setup the clone roots.
+	 */
+	for (i = 0; i < sctx->clone_roots_cnt; i++) {
+		cur_clone_root = sctx->clone_roots + i;
+		cur_clone_root->ino = (u64)-1;
+		cur_clone_root->offset = 0;
+		cur_clone_root->found_refs = 0;
+	}
+
+	backref_ctx.sctx = sctx;
+	backref_ctx.found = 0;
+	backref_ctx.cur_objectid = ino;
+	backref_ctx.cur_offset = data_offset;
+	backref_ctx.found_in_send_root = 0;
+	backref_ctx.extent_len = num_bytes;
+
+	/*
+	 * The last extent of a file may be too large due to page alignment.
+	 * We need to adjust extent_len in this case so that the checks in
+	 * __iterate_backrefs work.
+	 */
+	if (data_offset + num_bytes >= ino_size)
+		backref_ctx.extent_len = ino_size - data_offset;
+
+	/*
+	 * Now collect all backrefs.
+	 */
+	extent_item_pos = logical - found_key.objectid;
+	ret = iterate_extent_inodes(sctx->send_root->fs_info,
+					found_key.objectid, extent_item_pos, 1,
+					__iterate_backrefs, &backref_ctx);
+	if (ret < 0)
+		goto out;
+
+	if (!backref_ctx.found_in_send_root) {
+		/* found a bug in backref code? */
+		ret = -EIO;
+		printk(KERN_ERR "btrfs: ERROR did not find backref in "
+				"send_root. inode=%llu, offset=%llu, "
+				"logical=%llu\n",
+				ino, data_offset, logical);
+		goto out;
+	}
+
+verbose_printk(KERN_DEBUG "btrfs: find_extent_clone: data_offset=%llu, "
+		"ino=%llu, "
+		"num_bytes=%llu, logical=%llu\n",
+		data_offset, ino, num_bytes, logical);
+
+	if (!backref_ctx.found)
+		verbose_printk("btrfs:    no clones found\n");
+
+	cur_clone_root = NULL;
+	for (i = 0; i < sctx->clone_roots_cnt; i++) {
+		if (sctx->clone_roots[i].found_refs) {
+			if (!cur_clone_root)
+				cur_clone_root = sctx->clone_roots + i;
+			else if (sctx->clone_roots[i].root == sctx->send_root)
+				/* prefer clones from send_root over others */
+				cur_clone_root = sctx->clone_roots + i;
+			break;
+		}
+
+	}
+
+	if (cur_clone_root) {
+		*found = cur_clone_root;
+		ret = 0;
+	} else {
+		ret = -ENOENT;
+	}
+
+out:
+	btrfs_free_path(tmp_path);
+	return ret;
+}
+
+static int read_symlink(struct send_ctx *sctx,
+			struct btrfs_root *root,
+			u64 ino,
+			struct fs_path *dest)
+{
+	int ret;
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	struct btrfs_file_extent_item *ei;
+	u8 type;
+	u8 compression;
+	unsigned long off;
+	int len;
+
+	path = alloc_path_for_send();
+	if (!path)
+		return -ENOMEM;
+
+	key.objectid = ino;
+	key.type = BTRFS_EXTENT_DATA_KEY;
+	key.offset = 0;
+	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+	if (ret < 0)
+		goto out;
+	BUG_ON(ret);
+
+	ei = btrfs_item_ptr(path->nodes[0], path->slots[0],
+			struct btrfs_file_extent_item);
+	type = btrfs_file_extent_type(path->nodes[0], ei);
+	compression = btrfs_file_extent_compression(path->nodes[0], ei);
+	BUG_ON(type != BTRFS_FILE_EXTENT_INLINE);
+	BUG_ON(compression);
+
+	off = btrfs_file_extent_inline_start(ei);
+	len = btrfs_file_extent_inline_len(path->nodes[0], ei);
+
+	ret = fs_path_add_from_extent_buffer(dest, path->nodes[0], off, len);
+	if (ret < 0)
+		goto out;
+
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+/*
+ * Helper function to generate a file name that is unique in the root of
+ * send_root and parent_root. This is used to generate names for orphan inodes.
+ */
+static int gen_unique_name(struct send_ctx *sctx,
+			   u64 ino, u64 gen,
+			   struct fs_path *dest)
+{
+	int ret = 0;
+	struct btrfs_path *path;
+	struct btrfs_dir_item *di;
+	char tmp[64];
+	int len;
+	u64 idx = 0;
+
+	path = alloc_path_for_send();
+	if (!path)
+		return -ENOMEM;
+
+	while (1) {
+		len = snprintf(tmp, sizeof(tmp) - 1, "o%llu-%llu-%llu",
+				ino, gen, idx);
+		if (len >= sizeof(tmp)) {
+			/* should really not happen */
+			ret = -EOVERFLOW;
+			goto out;
+		}
+
+		di = btrfs_lookup_dir_item(NULL, sctx->send_root,
+				path, BTRFS_FIRST_FREE_OBJECTID,
+				tmp, strlen(tmp), 0);
+		btrfs_release_path(path);
+		if (IS_ERR(di)) {
+			ret = PTR_ERR(di);
+			goto out;
+		}
+		if (di) {
+			/* not unique, try again */
+			idx++;
+			continue;
+		}
+
+		if (!sctx->parent_root) {
+			/* unique */
+			ret = 0;
+			break;
+		}
+
+		di = btrfs_lookup_dir_item(NULL, sctx->parent_root,
+				path, BTRFS_FIRST_FREE_OBJECTID,
+				tmp, strlen(tmp), 0);
+		btrfs_release_path(path);
+		if (IS_ERR(di)) {
+			ret = PTR_ERR(di);
+			goto out;
+		}
+		if (di) {
+			/* not unique, try again */
+			idx++;
+			continue;
+		}
+		/* unique */
+		break;
+	}
+
+	ret = fs_path_add(dest, tmp, strlen(tmp));
+
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+enum inode_state {
+	inode_state_no_change,
+	inode_state_will_create,
+	inode_state_did_create,
+	inode_state_will_delete,
+	inode_state_did_delete,
+};
+
+static int get_cur_inode_state(struct send_ctx *sctx, u64 ino, u64 gen)
+{
+	int ret;
+	int left_ret;
+	int right_ret;
+	u64 left_gen;
+	u64 right_gen;
+
+	ret = get_inode_info(sctx->send_root, ino, NULL, &left_gen, NULL, NULL,
+			NULL);
+	if (ret < 0 && ret != -ENOENT)
+		goto out;
+	left_ret = ret;
+
+	if (!sctx->parent_root) {
+		right_ret = -ENOENT;
+	} else {
+		ret = get_inode_info(sctx->parent_root, ino, NULL, &right_gen,
+				NULL, NULL, NULL);
+		if (ret < 0 && ret != -ENOENT)
+			goto out;
+		right_ret = ret;
+	}
+
+	if (!left_ret && !right_ret) {
+		if (left_gen == gen && right_gen == gen)
+			ret = inode_state_no_change;
+		else if (left_gen == gen) {
+			if (ino < sctx->send_progress)
+				ret = inode_state_did_create;
+			else
+				ret = inode_state_will_create;
+		} else if (right_gen == gen) {
+			if (ino < sctx->send_progress)
+				ret = inode_state_did_delete;
+			else
+				ret = inode_state_will_delete;
+		} else  {
+			ret = -ENOENT;
+		}
+	} else if (!left_ret) {
+		if (left_gen == gen) {
+			if (ino < sctx->send_progress)
+				ret = inode_state_did_create;
+			else
+				ret = inode_state_will_create;
+		} else {
+			ret = -ENOENT;
+		}
+	} else if (!right_ret) {
+		if (right_gen == gen) {
+			if (ino < sctx->send_progress)
+				ret = inode_state_did_delete;
+			else
+				ret = inode_state_will_delete;
+		} else {
+			ret = -ENOENT;
+		}
+	} else {
+		ret = -ENOENT;
+	}
+
+out:
+	return ret;
+}
+
+static int is_inode_existent(struct send_ctx *sctx, u64 ino, u64 gen)
+{
+	int ret;
+
+	ret = get_cur_inode_state(sctx, ino, gen);
+	if (ret < 0)
+		goto out;
+
+	if (ret == inode_state_no_change ||
+	    ret == inode_state_did_create ||
+	    ret == inode_state_will_delete)
+		ret = 1;
+	else
+		ret = 0;
+
+out:
+	return ret;
+}
+
+/*
+ * Helper function to lookup a dir item in a dir.
+ */
+static int lookup_dir_item_inode(struct btrfs_root *root,
+				 u64 dir, const char *name, int name_len,
+				 u64 *found_inode,
+				 u8 *found_type)
+{
+	int ret = 0;
+	struct btrfs_dir_item *di;
+	struct btrfs_key key;
+	struct btrfs_path *path;
+
+	path = alloc_path_for_send();
+	if (!path)
+		return -ENOMEM;
+
+	di = btrfs_lookup_dir_item(NULL, root, path,
+			dir, name, name_len, 0);
+	if (!di) {
+		ret = -ENOENT;
+		goto out;
+	}
+	if (IS_ERR(di)) {
+		ret = PTR_ERR(di);
+		goto out;
+	}
+	btrfs_dir_item_key_to_cpu(path->nodes[0], di, &key);
+	*found_inode = key.objectid;
+	*found_type = btrfs_dir_type(path->nodes[0], di);
+
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+static int get_first_ref(struct send_ctx *sctx,
+			 struct btrfs_root *root, u64 ino,
+			 u64 *dir, u64 *dir_gen, struct fs_path *name)
+{
+	int ret;
+	struct btrfs_key key;
+	struct btrfs_key found_key;
+	struct btrfs_path *path;
+	struct btrfs_inode_ref *iref;
+	int len;
+
+	path = alloc_path_for_send();
+	if (!path)
+		return -ENOMEM;
+
+	key.objectid = ino;
+	key.type = BTRFS_INODE_REF_KEY;
+	key.offset = 0;
+
+	ret = btrfs_search_slot_for_read(root, &key, path, 1, 0);
+	if (ret < 0)
+		goto out;
+	if (!ret)
+		btrfs_item_key_to_cpu(path->nodes[0], &found_key,
+				path->slots[0]);
+	if (ret || found_key.objectid != key.objectid ||
+	    found_key.type != key.type) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	iref = btrfs_item_ptr(path->nodes[0], path->slots[0],
+			struct btrfs_inode_ref);
+	len = btrfs_inode_ref_name_len(path->nodes[0], iref);
+	ret = fs_path_add_from_extent_buffer(name, path->nodes[0],
+			(unsigned long)(iref + 1), len);
+	if (ret < 0)
+		goto out;
+	btrfs_release_path(path);
+
+	ret = get_inode_info(root, found_key.offset, NULL, dir_gen, NULL, NULL,
+			NULL);
+	if (ret < 0)
+		goto out;
+
+	*dir = found_key.offset;
+
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+static int is_first_ref(struct send_ctx *sctx,
+			struct btrfs_root *root,
+			u64 ino, u64 dir,
+			const char *name, int name_len)
+{
+	int ret;
+	struct fs_path *tmp_name;
+	u64 tmp_dir;
+	u64 tmp_dir_gen;
+
+	tmp_name = fs_path_alloc(sctx);
+	if (!tmp_name)
+		return -ENOMEM;
+
+	ret = get_first_ref(sctx, root, ino, &tmp_dir, &tmp_dir_gen, tmp_name);
+	if (ret < 0)
+		goto out;
+
+	if (name_len != fs_path_len(tmp_name)) {
+		ret = 0;
+		goto out;
+	}
+
+	ret = memcmp(tmp_name->start, name, name_len);
+	if (ret)
+		ret = 0;
+	else
+		ret = 1;
+
+out:
+	fs_path_free(sctx, tmp_name);
+	return ret;
+}
+
+static int will_overwrite_ref(struct send_ctx *sctx, u64 dir, u64 dir_gen,
+			      const char *name, int name_len,
+			      u64 *who_ino, u64 *who_gen)
+{
+	int ret = 0;
+	u64 other_inode = 0;
+	u8 other_type = 0;
+
+	if (!sctx->parent_root)
+		goto out;
+
+	ret = is_inode_existent(sctx, dir, dir_gen);
+	if (ret <= 0)
+		goto out;
+
+	ret = lookup_dir_item_inode(sctx->parent_root, dir, name, name_len,
+			&other_inode, &other_type);
+	if (ret < 0 && ret != -ENOENT)
+		goto out;
+	if (ret) {
+		ret = 0;
+		goto out;
+	}
+
+	if (other_inode > sctx->send_progress) {
+		ret = get_inode_info(sctx->parent_root, other_inode, NULL,
+				who_gen, NULL, NULL, NULL);
+		if (ret < 0)
+			goto out;
+
+		ret = 1;
+		*who_ino = other_inode;
+	} else {
+		ret = 0;
+	}
+
+out:
+	return ret;
+}
+
+static int did_overwrite_ref(struct send_ctx *sctx,
+			    u64 dir, u64 dir_gen,
+			    u64 ino, u64 ino_gen,
+			    const char *name, int name_len)
+{
+	int ret = 0;
+	u64 gen;
+	u64 ow_inode;
+	u8 other_type;
+
+	if (!sctx->parent_root)
+		goto out;
+
+	ret = is_inode_existent(sctx, dir, dir_gen);
+	if (ret <= 0)
+		goto out;
+
+	/* check if the ref was overwritten by another ref */
+	ret = lookup_dir_item_inode(sctx->send_root, dir, name, name_len,
+			&ow_inode, &other_type);
+	if (ret < 0 && ret != -ENOENT)
+		goto out;
+	if (ret) {
+		/* was never and will never be overwritten */
+		ret = 0;
+		goto out;
+	}
+
+	ret = get_inode_info(sctx->send_root, ow_inode, NULL, &gen, NULL, NULL,
+			NULL);
+	if (ret < 0)
+		goto out;
+
+	if (ow_inode == ino && gen == ino_gen) {
+		ret = 0;
+		goto out;
+	}
+
+	/* we know that it is or will be overwritten. check this now */
+	if (ow_inode < sctx->send_progress)
+		ret = 1;
+	else
+		ret = 0;
+
+out:
+	return ret;
+}
+
+static int did_overwrite_first_ref(struct send_ctx *sctx, u64 ino, u64 gen)
+{
+	int ret = 0;
+	struct fs_path *name = NULL;
+	u64 dir;
+	u64 dir_gen;
+
+	if (!sctx->parent_root)
+		goto out;
+
+	name = fs_path_alloc(sctx);
+	if (!name)
+		return -ENOMEM;
+
+	ret = get_first_ref(sctx, sctx->parent_root, ino, &dir, &dir_gen, name);
+	if (ret < 0)
+		goto out;
+
+	ret = did_overwrite_ref(sctx, dir, dir_gen, ino, gen,
+			name->start, fs_path_len(name));
+	if (ret < 0)
+		goto out;
+
+out:
+	fs_path_free(sctx, name);
+	return ret;
+}
+
+static int name_cache_insert(struct send_ctx *sctx,
+			     struct name_cache_entry *nce)
+{
+	int ret = 0;
+	struct name_cache_entry **ncea;
+
+	ncea = radix_tree_lookup(&sctx->name_cache, nce->ino);
+	if (ncea) {
+		if (!ncea[0])
+			ncea[0] = nce;
+		else if (!ncea[1])
+			ncea[1] = nce;
+		else
+			BUG();
+	} else {
+		ncea = kmalloc(sizeof(void *) * 2, GFP_NOFS);
+		if (!ncea)
+			return -ENOMEM;
+
+		ncea[0] = nce;
+		ncea[1] = NULL;
+		ret = radix_tree_insert(&sctx->name_cache, nce->ino, ncea);
+		if (ret < 0)
+			return ret;
+	}
+	list_add_tail(&nce->list, &sctx->name_cache_list);
+	sctx->name_cache_size++;
+
+	return ret;
+}
+
+static void name_cache_delete(struct send_ctx *sctx,
+			      struct name_cache_entry *nce)
+{
+	struct name_cache_entry **ncea;
+
+	ncea = radix_tree_lookup(&sctx->name_cache, nce->ino);
+	BUG_ON(!ncea);
+
+	if (ncea[0] == nce)
+		ncea[0] = NULL;
+	else if (ncea[1] == nce)
+		ncea[1] = NULL;
+	else
+		BUG();
+
+	if (!ncea[0] && !ncea[1]) {
+		radix_tree_delete(&sctx->name_cache, nce->ino);
+		kfree(ncea);
+	}
+
+	list_del(&nce->list);
+
+	sctx->name_cache_size--;
+}
+
+static struct name_cache_entry *name_cache_search(struct send_ctx *sctx,
+						    u64 ino, u64 gen)
+{
+	struct name_cache_entry **ncea;
+
+	ncea = radix_tree_lookup(&sctx->name_cache, ino);
+	if (!ncea)
+		return NULL;
+
+	if (ncea[0] && ncea[0]->gen == gen)
+		return ncea[0];
+	else if (ncea[1] && ncea[1]->gen == gen)
+		return ncea[1];
+	return NULL;
+}
+
+static void name_cache_used(struct send_ctx *sctx, struct name_cache_entry *nce)
+{
+	list_del(&nce->list);
+	list_add_tail(&nce->list, &sctx->name_cache_list);
+}
+
+static void name_cache_clean_unused(struct send_ctx *sctx)
+{
+	struct name_cache_entry *nce;
+
+	if (sctx->name_cache_size < SEND_CTX_NAME_CACHE_CLEAN_SIZE)
+		return;
+
+	while (sctx->name_cache_size > SEND_CTX_MAX_NAME_CACHE_SIZE) {
+		nce = list_entry(sctx->name_cache_list.next,
+				struct name_cache_entry, list);
+		name_cache_delete(sctx, nce);
+		kfree(nce);
+	}
+}
+
+static void name_cache_free(struct send_ctx *sctx)
+{
+	struct name_cache_entry *nce;
+	struct name_cache_entry *tmp;
+
+	list_for_each_entry_safe(nce, tmp, &sctx->name_cache_list, list) {
+		name_cache_delete(sctx, nce);
+	}
+}
+
+static int __get_cur_name_and_parent(struct send_ctx *sctx,
+				     u64 ino, u64 gen,
+				     u64 *parent_ino,
+				     u64 *parent_gen,
+				     struct fs_path *dest)
+{
+	int ret;
+	int nce_ret;
+	struct btrfs_path *path = NULL;
+	struct name_cache_entry *nce = NULL;
+
+	nce = name_cache_search(sctx, ino, gen);
+	if (nce) {
+		if (ino < sctx->send_progress && nce->need_later_update) {
+			name_cache_delete(sctx, nce);
+			kfree(nce);
+			nce = NULL;
+		} else {
+			name_cache_used(sctx, nce);
+			*parent_ino = nce->parent_ino;
+			*parent_gen = nce->parent_gen;
+			ret = fs_path_add(dest, nce->name, nce->name_len);
+			if (ret < 0)
+				goto out;
+			ret = nce->ret;
+			goto out;
+		}
+	}
+
+	path = alloc_path_for_send();
+	if (!path)
+		return -ENOMEM;
+
+	ret = is_inode_existent(sctx, ino, gen);
+	if (ret < 0)
+		goto out;
+
+	if (!ret) {
+		ret = gen_unique_name(sctx, ino, gen, dest);
+		if (ret < 0)
+			goto out;
+		ret = 1;
+		goto out_cache;
+	}
+
+	if (ino < sctx->send_progress)
+		ret = get_first_ref(sctx, sctx->send_root, ino,
+				parent_ino, parent_gen, dest);
+	else
+		ret = get_first_ref(sctx, sctx->parent_root, ino,
+				parent_ino, parent_gen, dest);
+	if (ret < 0)
+		goto out;
+
+	ret = did_overwrite_ref(sctx, *parent_ino, *parent_gen, ino, gen,
+			dest->start, dest->end - dest->start);
+	if (ret < 0)
+		goto out;
+	if (ret) {
+		fs_path_reset(dest);
+		ret = gen_unique_name(sctx, ino, gen, dest);
+		if (ret < 0)
+			goto out;
+		ret = 1;
+	}
+
+out_cache:
+	nce = kmalloc(sizeof(*nce) + fs_path_len(dest) + 1, GFP_NOFS);
+	if (!nce) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	nce->ino = ino;
+	nce->gen = gen;
+	nce->parent_ino = *parent_ino;
+	nce->parent_gen = *parent_gen;
+	nce->name_len = fs_path_len(dest);
+	nce->ret = ret;
+	strcpy(nce->name, dest->start);
+	memset(&nce->use_list, 0, sizeof(nce->use_list));
+
+	if (ino < sctx->send_progress)
+		nce->need_later_update = 0;
+	else
+		nce->need_later_update = 1;
+
+	nce_ret = name_cache_insert(sctx, nce);
+	if (nce_ret < 0)
+		ret = nce_ret;
+	name_cache_clean_unused(sctx);
+
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+/*
+ * Magic happens here. This function returns the first ref to an inode as it
+ * would look like while receiving the stream at this point in time.
+ * We walk the path up to the root. For every inode in between, we check if it
+ * was already processed/sent. If yes, we continue with the parent as found
+ * in send_root. If not, we continue with the parent as found in parent_root.
+ * If we encounter an inode that was deleted at this point in time, we use the
+ * inodes "orphan" name instead of the real name and stop. Same with new inodes
+ * that were not created yet and overwritten inodes/refs.
+ *
+ * When do we have have orphan inodes:
+ * 1. When an inode is freshly created and thus no valid refs are available yet
+ * 2. When a directory lost all it's refs (deleted) but still has dir items
+ *    inside which were not processed yet (pending for move/delete). If anyone
+ *    tried to get the path to the dir items, it would get a path inside that
+ *    orphan directory.
+ * 3. When an inode is moved around or gets new links, it may overwrite the ref
+ *    of an unprocessed inode. If in that case the first ref would be
+ *    overwritten, the overwritten inode gets "orphanized". Later when we
+ *    process this overwritten inode, it is restored at a new place by moving
+ *    the orphan inode.
+ *
+ * sctx->send_progress tells this function at which point in time receiving
+ * would be.
+ */
+static int get_cur_path(struct send_ctx *sctx, u64 ino, u64 gen,
+			struct fs_path *dest)
+{
+	int ret = 0;
+	struct fs_path *name = NULL;
+	u64 parent_inode = 0;
+	u64 parent_gen = 0;
+	int stop = 0;
+
+	name = fs_path_alloc(sctx);
+	if (!name) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	dest->reversed = 1;
+	fs_path_reset(dest);
+
+	while (!stop && ino != BTRFS_FIRST_FREE_OBJECTID) {
+		fs_path_reset(name);
+
+		ret = __get_cur_name_and_parent(sctx, ino, gen,
+				&parent_inode, &parent_gen, name);
+		if (ret < 0)
+			goto out;
+		if (ret)
+			stop = 1;
+
+		ret = fs_path_add_path(dest, name);
+		if (ret < 0)
+			goto out;
+
+		ino = parent_inode;
+		gen = parent_gen;
+	}
+
+out:
+	fs_path_free(sctx, name);
+	if (!ret)
+		fs_path_unreverse(dest);
+	return ret;
+}
+
+/*
+ * Called for regular files when sending extents data. Opens a struct file
+ * to read from the file.
+ */
+static int open_cur_inode_file(struct send_ctx *sctx)
+{
+	int ret = 0;
+	struct btrfs_key key;
+	struct vfsmount *mnt;
+	struct inode *inode;
+	struct dentry *dentry;
+	struct file *filp;
+	int new = 0;
+
+	if (sctx->cur_inode_filp)
+		goto out;
+
+	key.objectid = sctx->cur_ino;
+	key.type = BTRFS_INODE_ITEM_KEY;
+	key.offset = 0;
+
+	inode = btrfs_iget(sctx->send_root->fs_info->sb, &key, sctx->send_root,
+			&new);
+	if (IS_ERR(inode)) {
+		ret = PTR_ERR(inode);
+		goto out;
+	}
+
+	dentry = d_obtain_alias(inode);
+	inode = NULL;
+	if (IS_ERR(dentry)) {
+		ret = PTR_ERR(dentry);
+		goto out;
+	}
+
+	mnt = mntget(sctx->mnt);
+	filp = dentry_open(dentry, mnt, O_RDONLY | O_LARGEFILE, current_cred());
+	dentry = NULL;
+	mnt = NULL;
+	if (IS_ERR(filp)) {
+		ret = PTR_ERR(filp);
+		goto out;
+	}
+	sctx->cur_inode_filp = filp;
+
+out:
+	/*
+	 * no xxxput required here as every vfs op
+	 * does it by itself on failure
+	 */
+	return ret;
+}
+
+/*
+ * Closes the struct file that was created in open_cur_inode_file
+ */
+static int close_cur_inode_file(struct send_ctx *sctx)
+{
+	int ret = 0;
+
+	if (!sctx->cur_inode_filp)
+		goto out;
+
+	ret = filp_close(sctx->cur_inode_filp, NULL);
+	sctx->cur_inode_filp = NULL;
+
+out:
+	return ret;
+}
+
+/*
+ * Sends a BTRFS_SEND_C_SUBVOL command/item to userspace
+ */
+static int send_subvol_begin(struct send_ctx *sctx)
+{
+	int ret;
+	struct btrfs_root *send_root = sctx->send_root;
+	struct btrfs_root *parent_root = sctx->parent_root;
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	struct btrfs_root_ref *ref;
+	struct extent_buffer *leaf;
+	char *name = NULL;
+	int namelen;
+
+	path = alloc_path_for_send();
+	if (!path)
+		return -ENOMEM;
+
+	name = kmalloc(BTRFS_PATH_NAME_MAX, GFP_NOFS);
+	if (!name) {
+		btrfs_free_path(path);
+		return -ENOMEM;
+	}
+
+	key.objectid = send_root->objectid;
+	key.type = BTRFS_ROOT_BACKREF_KEY;
+	key.offset = 0;
+
+	ret = btrfs_search_slot_for_read(send_root->fs_info->tree_root,
+				&key, path, 1, 0);
+	if (ret < 0)
+		goto out;
+	if (ret) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	leaf = path->nodes[0];
+	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+	if (key.type != BTRFS_ROOT_BACKREF_KEY ||
+	    key.objectid != send_root->objectid) {
+		ret = -ENOENT;
+		goto out;
+	}
+	ref = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_root_ref);
+	namelen = btrfs_root_ref_name_len(leaf, ref);
+	read_extent_buffer(leaf, name, (unsigned long)(ref + 1), namelen);
+	btrfs_release_path(path);
+
+	if (ret < 0)
+		goto out;
+
+	if (parent_root) {
+		ret = begin_cmd(sctx, BTRFS_SEND_C_SNAPSHOT);
+		if (ret < 0)
+			goto out;
+	} else {
+		ret = begin_cmd(sctx, BTRFS_SEND_C_SUBVOL);
+		if (ret < 0)
+			goto out;
+	}
+
+	TLV_PUT_STRING(sctx, BTRFS_SEND_A_PATH, name, namelen);
+	TLV_PUT_UUID(sctx, BTRFS_SEND_A_UUID,
+			sctx->send_root->root_item.uuid);
+	TLV_PUT_U64(sctx, BTRFS_SEND_A_CTRANSID,
+			sctx->send_root->root_item.ctransid);
+	if (parent_root) {
+		TLV_PUT_UUID(sctx, BTRFS_SEND_A_CLONE_UUID,
+				sctx->parent_root->root_item.uuid);
+		TLV_PUT_U64(sctx, BTRFS_SEND_A_CLONE_CTRANSID,
+				sctx->parent_root->root_item.ctransid);
+	}
+
+	ret = send_cmd(sctx);
+
+tlv_put_failure:
+out:
+	btrfs_free_path(path);
+	kfree(name);
+	return ret;
+}
+
+static int send_truncate(struct send_ctx *sctx, u64 ino, u64 gen, u64 size)
+{
+	int ret = 0;
+	struct fs_path *p;
+
+verbose_printk("btrfs: send_truncate %llu size=%llu\n", ino, size);
+
+	p = fs_path_alloc(sctx);
+	if (!p)
+		return -ENOMEM;
+
+	ret = begin_cmd(sctx, BTRFS_SEND_C_TRUNCATE);
+	if (ret < 0)
+		goto out;
+
+	ret = get_cur_path(sctx, ino, gen, p);
+	if (ret < 0)
+		goto out;
+	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
+	TLV_PUT_U64(sctx, BTRFS_SEND_A_SIZE, size);
+
+	ret = send_cmd(sctx);
+
+tlv_put_failure:
+out:
+	fs_path_free(sctx, p);
+	return ret;
+}
+
+static int send_chmod(struct send_ctx *sctx, u64 ino, u64 gen, u64 mode)
+{
+	int ret = 0;
+	struct fs_path *p;
+
+verbose_printk("btrfs: send_chmod %llu mode=%llu\n", ino, mode);
+
+	p = fs_path_alloc(sctx);
+	if (!p)
+		return -ENOMEM;
+
+	ret = begin_cmd(sctx, BTRFS_SEND_C_CHMOD);
+	if (ret < 0)
+		goto out;
+
+	ret = get_cur_path(sctx, ino, gen, p);
+	if (ret < 0)
+		goto out;
+	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
+	TLV_PUT_U64(sctx, BTRFS_SEND_A_MODE, mode & 07777);
+
+	ret = send_cmd(sctx);
+
+tlv_put_failure:
+out:
+	fs_path_free(sctx, p);
+	return ret;
+}
+
+static int send_chown(struct send_ctx *sctx, u64 ino, u64 gen, u64 uid, u64 gid)
+{
+	int ret = 0;
+	struct fs_path *p;
+
+verbose_printk("btrfs: send_chown %llu uid=%llu, gid=%llu\n", ino, uid, gid);
+
+	p = fs_path_alloc(sctx);
+	if (!p)
+		return -ENOMEM;
+
+	ret = begin_cmd(sctx, BTRFS_SEND_C_CHOWN);
+	if (ret < 0)
+		goto out;
+
+	ret = get_cur_path(sctx, ino, gen, p);
+	if (ret < 0)
+		goto out;
+	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
+	TLV_PUT_U64(sctx, BTRFS_SEND_A_UID, uid);
+	TLV_PUT_U64(sctx, BTRFS_SEND_A_GID, gid);
+
+	ret = send_cmd(sctx);
+
+tlv_put_failure:
+out:
+	fs_path_free(sctx, p);
+	return ret;
+}
+
+static int send_utimes(struct send_ctx *sctx, u64 ino, u64 gen)
+{
+	int ret = 0;
+	struct fs_path *p = NULL;
+	struct btrfs_inode_item *ii;
+	struct btrfs_path *path = NULL;
+	struct extent_buffer *eb;
+	struct btrfs_key key;
+	int slot;
+
+verbose_printk("btrfs: send_utimes %llu\n", ino);
+
+	p = fs_path_alloc(sctx);
+	if (!p)
+		return -ENOMEM;
+
+	path = alloc_path_for_send();
+	if (!path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	key.objectid = ino;
+	key.type = BTRFS_INODE_ITEM_KEY;
+	key.offset = 0;
+	ret = btrfs_search_slot(NULL, sctx->send_root, &key, path, 0, 0);
+	if (ret < 0)
+		goto out;
+
+	eb = path->nodes[0];
+	slot = path->slots[0];
+	ii = btrfs_item_ptr(eb, slot, struct btrfs_inode_item);
+
+	ret = begin_cmd(sctx, BTRFS_SEND_C_UTIMES);
+	if (ret < 0)
+		goto out;
+
+	ret = get_cur_path(sctx, ino, gen, p);
+	if (ret < 0)
+		goto out;
+	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
+	TLV_PUT_BTRFS_TIMESPEC(sctx, BTRFS_SEND_A_ATIME, eb,
+			btrfs_inode_atime(ii));
+	TLV_PUT_BTRFS_TIMESPEC(sctx, BTRFS_SEND_A_MTIME, eb,
+			btrfs_inode_mtime(ii));
+	TLV_PUT_BTRFS_TIMESPEC(sctx, BTRFS_SEND_A_CTIME, eb,
+			btrfs_inode_ctime(ii));
+	/* TODO otime? */
+
+	ret = send_cmd(sctx);
+
+tlv_put_failure:
+out:
+	fs_path_free(sctx, p);
+	btrfs_free_path(path);
+	return ret;
+}
+
+/*
+ * Sends a BTRFS_SEND_C_MKXXX or SYMLINK command to user space. We don't have
+ * a valid path yet because we did not process the refs yet. So, the inode
+ * is created as orphan.
+ */
+static int send_create_inode(struct send_ctx *sctx, struct btrfs_path *path,
+			     struct btrfs_key *key)
+{
+	int ret = 0;
+	struct extent_buffer *eb = path->nodes[0];
+	struct btrfs_inode_item *ii;
+	struct fs_path *p;
+	int slot = path->slots[0];
+	int cmd;
+	u64 mode;
+
+verbose_printk("btrfs: send_create_inode %llu\n", sctx->cur_ino);
+
+	p = fs_path_alloc(sctx);
+	if (!p)
+		return -ENOMEM;
+
+	ii = btrfs_item_ptr(eb, slot, struct btrfs_inode_item);
+	mode = btrfs_inode_mode(eb, ii);
+
+	if (S_ISREG(mode))
+		cmd = BTRFS_SEND_C_MKFILE;
+	else if (S_ISDIR(mode))
+		cmd = BTRFS_SEND_C_MKDIR;
+	else if (S_ISLNK(mode))
+		cmd = BTRFS_SEND_C_SYMLINK;
+	else if (S_ISCHR(mode) || S_ISBLK(mode))
+		cmd = BTRFS_SEND_C_MKNOD;
+	else if (S_ISFIFO(mode))
+		cmd = BTRFS_SEND_C_MKFIFO;
+	else if (S_ISSOCK(mode))
+		cmd = BTRFS_SEND_C_MKSOCK;
+	else {
+		printk(KERN_WARNING "btrfs: unexpected inode type %o",
+				(int)(mode & S_IFMT));
+		ret = -ENOTSUPP;
+		goto out;
+	}
+
+	ret = begin_cmd(sctx, cmd);
+	if (ret < 0)
+		goto out;
+
+	ret = gen_unique_name(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
+	if (ret < 0)
+		goto out;
+
+	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
+
+	if (S_ISLNK(mode)) {
+		fs_path_reset(p);
+		ret = read_symlink(sctx, sctx->send_root, sctx->cur_ino, p);
+		if (ret < 0)
+			goto out;
+		TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH_LINK, p);
+	} else if (S_ISCHR(mode) || S_ISBLK(mode) ||
+		   S_ISFIFO(mode) || S_ISSOCK(mode)) {
+		TLV_PUT_U64(sctx, BTRFS_SEND_A_RDEV, btrfs_inode_rdev(eb, ii));
+	}
+
+	ret = send_cmd(sctx);
+	if (ret < 0)
+		goto out;
+
+
+tlv_put_failure:
+out:
+	fs_path_free(sctx, p);
+	return ret;
+}
+
+struct recorded_ref {
+	struct list_head list;
+	char *dir_path;
+	char *name;
+	struct fs_path *full_path;
+	u64 dir;
+	u64 dir_gen;
+	int dir_path_len;
+	int name_len;
+};
+
+/*
+ * We need to process new refs before deleted refs, but compare_tree gives us
+ * everything mixed. So we first record all refs and later process them.
+ * This function is a helper to record one ref.
+ */
+static int record_ref(struct list_head *head, u64 dir,
+		      u64 dir_gen, struct fs_path *path)
+{
+	struct recorded_ref *ref;
+	char *tmp;
+
+	ref = kmalloc(sizeof(*ref), GFP_NOFS);
+	if (!ref)
+		return -ENOMEM;
+
+	ref->dir = dir;
+	ref->dir_gen = dir_gen;
+	ref->full_path = path;
+
+	tmp = strrchr(ref->full_path->start, '/');
+	if (!tmp) {
+		ref->name_len = ref->full_path->end - ref->full_path->start;
+		ref->name = ref->full_path->start;
+		ref->dir_path_len = 0;
+		ref->dir_path = ref->full_path->start;
+	} else {
+		tmp++;
+		ref->name_len = ref->full_path->end - tmp;
+		ref->name = tmp;
+		ref->dir_path = ref->full_path->start;
+		ref->dir_path_len = ref->full_path->end -
+				ref->full_path->start - 1 - ref->name_len;
+	}
+
+	list_add_tail(&ref->list, head);
+	return 0;
+}
+
+static void __free_recorded_refs(struct send_ctx *sctx, struct list_head *head)
+{
+	struct recorded_ref *cur;
+	struct recorded_ref *tmp;
+
+	list_for_each_entry_safe(cur, tmp, head, list) {
+		fs_path_free(sctx, cur->full_path);
+		kfree(cur);
+	}
+	INIT_LIST_HEAD(head);
+}
+
+static void free_recorded_refs(struct send_ctx *sctx)
+{
+	__free_recorded_refs(sctx, &sctx->new_refs);
+	__free_recorded_refs(sctx, &sctx->deleted_refs);
+}
+
+/*
+ * Renames/moves a file/dir to it's orphan name. Used when the first
+ * ref of an unprocessed inode gets overwritten and for all non empty
+ * directories.
+ */
+static int orphanize_inode(struct send_ctx *sctx, u64 ino, u64 gen,
+			  struct fs_path *path)
+{
+	int ret;
+	struct fs_path *orphan;
+
+	orphan = fs_path_alloc(sctx);
+	if (!orphan)
+		return -ENOMEM;
+
+	ret = gen_unique_name(sctx, ino, gen, orphan);
+	if (ret < 0)
+		goto out;
+
+	ret = send_rename(sctx, path, orphan);
+
+out:
+	fs_path_free(sctx, orphan);
+	return ret;
+}
+
+/*
+ * Returns 1 if a directory can be removed at this point in time.
+ * We check this by iterating all dir items and checking if the inode behind
+ * the dir item was already processed.
+ */
+static int can_rmdir(struct send_ctx *sctx, u64 dir, u64 send_progress)
+{
+	int ret = 0;
+	struct btrfs_root *root = sctx->parent_root;
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	struct btrfs_key found_key;
+	struct btrfs_key loc;
+	struct btrfs_dir_item *di;
+
+	path = alloc_path_for_send();
+	if (!path)
+		return -ENOMEM;
+
+	key.objectid = dir;
+	key.type = BTRFS_DIR_INDEX_KEY;
+	key.offset = 0;
+
+	while (1) {
+		ret = btrfs_search_slot_for_read(root, &key, path, 1, 0);
+		if (ret < 0)
+			goto out;
+		if (!ret) {
+			btrfs_item_key_to_cpu(path->nodes[0], &found_key,
+					path->slots[0]);
+		}
+		if (ret || found_key.objectid != key.objectid ||
+		    found_key.type != key.type) {
+			break;
+		}
+
+		di = btrfs_item_ptr(path->nodes[0], path->slots[0],
+				struct btrfs_dir_item);
+		btrfs_dir_item_key_to_cpu(path->nodes[0], di, &loc);
+
+		if (loc.objectid > send_progress) {
+			ret = 0;
+			goto out;
+		}
+
+		btrfs_release_path(path);
+		key.offset = found_key.offset + 1;
+	}
+
+	ret = 1;
+
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+/*
+ * This does all the move/link/unlink/rmdir magic.
+ */
+static int process_recorded_refs(struct send_ctx *sctx)
+{
+	int ret = 0;
+	struct recorded_ref *cur;
+	struct ulist *check_dirs = NULL;
+	struct ulist_iterator uit;
+	struct ulist_node *un;
+	struct fs_path *valid_path = NULL;
+	u64 ow_inode;
+	u64 ow_gen;
+	int did_overwrite = 0;
+	int is_orphan = 0;
+
+verbose_printk("btrfs: process_recorded_refs %llu\n", sctx->cur_ino);
+
+	valid_path = fs_path_alloc(sctx);
+	if (!valid_path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	check_dirs = ulist_alloc(GFP_NOFS);
+	if (!check_dirs) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/*
+	 * First, check if the first ref of the current inode was overwritten
+	 * before. If yes, we know that the current inode was already orphanized
+	 * and thus use the orphan name. If not, we can use get_cur_path to
+	 * get the path of the first ref as it would like while receiving at
+	 * this point in time.
+	 * New inodes are always orphan at the beginning, so force to use the
+	 * orphan name in this case.
+	 * The first ref is stored in valid_path and will be updated if it
+	 * gets moved around.
+	 */
+	if (!sctx->cur_inode_new) {
+		ret = did_overwrite_first_ref(sctx, sctx->cur_ino,
+				sctx->cur_inode_gen);
+		if (ret < 0)
+			goto out;
+		if (ret)
+			did_overwrite = 1;
+	}
+	if (sctx->cur_inode_new || did_overwrite) {
+		ret = gen_unique_name(sctx, sctx->cur_ino,
+				sctx->cur_inode_gen, valid_path);
+		if (ret < 0)
+			goto out;
+		is_orphan = 1;
+	} else {
+		ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen,
+				valid_path);
+		if (ret < 0)
+			goto out;
+	}
+
+	list_for_each_entry(cur, &sctx->new_refs, list) {
+		/*
+		 * Check if this new ref would overwrite the first ref of
+		 * another unprocessed inode. If yes, orphanize the
+		 * overwritten inode. If we find an overwritten ref that is
+		 * not the first ref, simply unlink it.
+		 */
+		ret = will_overwrite_ref(sctx, cur->dir, cur->dir_gen,
+				cur->name, cur->name_len,
+				&ow_inode, &ow_gen);
+		if (ret < 0)
+			goto out;
+		if (ret) {
+			ret = is_first_ref(sctx, sctx->parent_root,
+					ow_inode, cur->dir, cur->name,
+					cur->name_len);
+			if (ret < 0)
+				goto out;
+			if (ret) {
+				ret = orphanize_inode(sctx, ow_inode, ow_gen,
+						cur->full_path);
+				if (ret < 0)
+					goto out;
+			} else {
+				ret = send_unlink(sctx, cur->full_path);
+				if (ret < 0)
+					goto out;
+			}
+		}
+
+		/*
+		 * link/move the ref to the new place. If we have an orphan
+		 * inode, move it and update valid_path. If not, link or move
+		 * it depending on the inode mode.
+		 */
+		if (is_orphan) {
+			ret = send_rename(sctx, valid_path, cur->full_path);
+			if (ret < 0)
+				goto out;
+			is_orphan = 0;
+			ret = fs_path_copy(valid_path, cur->full_path);
+			if (ret < 0)
+				goto out;
+		} else {
+			if (S_ISDIR(sctx->cur_inode_mode)) {
+				/*
+				 * Dirs can't be linked, so move it. For moved
+				 * dirs, we always have one new and one deleted
+				 * ref. The deleted ref is ignored later.
+				 */
+				ret = send_rename(sctx, valid_path,
+						cur->full_path);
+				if (ret < 0)
+					goto out;
+				ret = fs_path_copy(valid_path, cur->full_path);
+				if (ret < 0)
+					goto out;
+			} else {
+				ret = send_link(sctx, valid_path,
+						cur->full_path);
+				if (ret < 0)
+					goto out;
+			}
+		}
+		ret = ulist_add(check_dirs, cur->dir, cur->dir_gen,
+				GFP_NOFS);
+		if (ret < 0)
+			goto out;
+	}
+
+	if (S_ISDIR(sctx->cur_inode_mode) && sctx->cur_inode_deleted) {
+		/*
+		 * Check if we can already rmdir the directory. If not,
+		 * orphanize it. For every dir item inside that gets deleted
+		 * later, we do this check again and rmdir it then if possible.
+		 * See the use of check_dirs for more details.
+		 */
+		ret = can_rmdir(sctx, sctx->cur_ino, sctx->cur_ino);
+		if (ret < 0)
+			goto out;
+		if (ret) {
+			ret = send_rmdir(sctx, valid_path);
+			if (ret < 0)
+				goto out;
+		} else if (!is_orphan) {
+			ret = orphanize_inode(sctx, sctx->cur_ino,
+					sctx->cur_inode_gen, valid_path);
+			if (ret < 0)
+				goto out;
+			is_orphan = 1;
+		}
+
+		list_for_each_entry(cur, &sctx->deleted_refs, list) {
+			ret = ulist_add(check_dirs, cur->dir, cur->dir_gen,
+					GFP_NOFS);
+			if (ret < 0)
+				goto out;
+		}
+	} else if (!S_ISDIR(sctx->cur_inode_mode)) {
+		/*
+		 * We have a non dir inode. Go through all deleted refs and
+		 * unlink them if they were not already overwritten by other
+		 * inodes.
+		 */
+		list_for_each_entry(cur, &sctx->deleted_refs, list) {
+			ret = did_overwrite_ref(sctx, cur->dir, cur->dir_gen,
+					sctx->cur_ino, sctx->cur_inode_gen,
+					cur->name, cur->name_len);
+			if (ret < 0)
+				goto out;
+			if (!ret) {
+				ret = send_unlink(sctx, cur->full_path);
+				if (ret < 0)
+					goto out;
+			}
+			ret = ulist_add(check_dirs, cur->dir, cur->dir_gen,
+					GFP_NOFS);
+			if (ret < 0)
+				goto out;
+		}
+
+		/*
+		 * If the inode is still orphan, unlink the orphan. This may
+		 * happen when a previous inode did overwrite the first ref
+		 * of this inode and no new refs were added for the current
+		 * inode.
+		 */
+		if (is_orphan) {
+			ret = send_unlink(sctx, valid_path);
+			if (ret < 0)
+				goto out;
+		}
+	}
+
+	/*
+	 * We did collect all parent dirs where cur_inode was once located. We
+	 * now go through all these dirs and check if they are pending for
+	 * deletion and if it's finally possible to perform the rmdir now.
+	 * We also update the inode stats of the parent dirs here.
+	 */
+	ULIST_ITER_INIT(&uit);
+	while ((un = ulist_next(check_dirs, &uit))) {
+		if (un->val > sctx->cur_ino)
+			continue;
+
+		ret = get_cur_inode_state(sctx, un->val, un->aux);
+		if (ret < 0)
+			goto out;
+
+		if (ret == inode_state_did_create ||
+		    ret == inode_state_no_change) {
+			/* TODO delayed utimes */
+			ret = send_utimes(sctx, un->val, un->aux);
+			if (ret < 0)
+				goto out;
+		} else if (ret == inode_state_did_delete) {
+			ret = can_rmdir(sctx, un->val, sctx->cur_ino);
+			if (ret < 0)
+				goto out;
+			if (ret) {
+				ret = get_cur_path(sctx, un->val, un->aux,
+						valid_path);
+				if (ret < 0)
+					goto out;
+				ret = send_rmdir(sctx, valid_path);
+				if (ret < 0)
+					goto out;
+			}
+		}
+	}
+
+	/*
+	 * Current inode is now at it's new position, so we must increase
+	 * send_progress
+	 */
+	sctx->send_progress = sctx->cur_ino + 1;
+
+	ret = 0;
+
+out:
+	free_recorded_refs(sctx);
+	ulist_free(check_dirs);
+	fs_path_free(sctx, valid_path);
+	return ret;
+}
+
+static int __record_new_ref(int num, u64 dir, int index,
+			    struct fs_path *name,
+			    void *ctx)
+{
+	int ret = 0;
+	struct send_ctx *sctx = ctx;
+	struct fs_path *p;
+	u64 gen;
+
+	p = fs_path_alloc(sctx);
+	if (!p)
+		return -ENOMEM;
+
+	ret = get_inode_info(sctx->send_root, dir, NULL, &gen, NULL, NULL,
+			NULL);
+	if (ret < 0)
+		goto out;
+
+	ret = get_cur_path(sctx, dir, gen, p);
+	if (ret < 0)
+		goto out;
+	ret = fs_path_add_path(p, name);
+	if (ret < 0)
+		goto out;
+
+	ret = record_ref(&sctx->new_refs, dir, gen, p);
+
+out:
+	if (ret)
+		fs_path_free(sctx, p);
+	return ret;
+}
+
+static int __record_deleted_ref(int num, u64 dir, int index,
+				struct fs_path *name,
+				void *ctx)
+{
+	int ret = 0;
+	struct send_ctx *sctx = ctx;
+	struct fs_path *p;
+	u64 gen;
+
+	p = fs_path_alloc(sctx);
+	if (!p)
+		return -ENOMEM;
+
+	ret = get_inode_info(sctx->parent_root, dir, NULL, &gen, NULL, NULL,
+			NULL);
+	if (ret < 0)
+		goto out;
+
+	ret = get_cur_path(sctx, dir, gen, p);
+	if (ret < 0)
+		goto out;
+	ret = fs_path_add_path(p, name);
+	if (ret < 0)
+		goto out;
+
+	ret = record_ref(&sctx->deleted_refs, dir, gen, p);
+
+out:
+	if (ret)
+		fs_path_free(sctx, p);
+	return ret;
+}
+
+static int record_new_ref(struct send_ctx *sctx)
+{
+	int ret;
+
+	ret = iterate_inode_ref(sctx, sctx->send_root, sctx->left_path,
+			sctx->cmp_key, 0, __record_new_ref, sctx);
+
+	return ret;
+}
+
+static int record_deleted_ref(struct send_ctx *sctx)
+{
+	int ret;
+
+	ret = iterate_inode_ref(sctx, sctx->parent_root, sctx->right_path,
+			sctx->cmp_key, 0, __record_deleted_ref, sctx);
+	return ret;
+}
+
+struct find_ref_ctx {
+	u64 dir;
+	struct fs_path *name;
+	int found_idx;
+};
+
+static int __find_iref(int num, u64 dir, int index,
+		       struct fs_path *name,
+		       void *ctx_)
+{
+	struct find_ref_ctx *ctx = ctx_;
+
+	if (dir == ctx->dir && fs_path_len(name) == fs_path_len(ctx->name) &&
+	    strncmp(name->start, ctx->name->start, fs_path_len(name)) == 0) {
+		ctx->found_idx = num;
+		return 1;
+	}
+	return 0;
+}
+
+static int find_iref(struct send_ctx *sctx,
+		     struct btrfs_root *root,
+		     struct btrfs_path *path,
+		     struct btrfs_key *key,
+		     u64 dir, struct fs_path *name)
+{
+	int ret;
+	struct find_ref_ctx ctx;
+
+	ctx.dir = dir;
+	ctx.name = name;
+	ctx.found_idx = -1;
+
+	ret = iterate_inode_ref(sctx, root, path, key, 0, __find_iref, &ctx);
+	if (ret < 0)
+		return ret;
+
+	if (ctx.found_idx == -1)
+		return -ENOENT;
+
+	return ctx.found_idx;
+}
+
+static int __record_changed_new_ref(int num, u64 dir, int index,
+				    struct fs_path *name,
+				    void *ctx)
+{
+	int ret;
+	struct send_ctx *sctx = ctx;
+
+	ret = find_iref(sctx, sctx->parent_root, sctx->right_path,
+			sctx->cmp_key, dir, name);
+	if (ret == -ENOENT)
+		ret = __record_new_ref(num, dir, index, name, sctx);
+	else if (ret > 0)
+		ret = 0;
+
+	return ret;
+}
+
+static int __record_changed_deleted_ref(int num, u64 dir, int index,
+					struct fs_path *name,
+					void *ctx)
+{
+	int ret;
+	struct send_ctx *sctx = ctx;
+
+	ret = find_iref(sctx, sctx->send_root, sctx->left_path, sctx->cmp_key,
+			dir, name);
+	if (ret == -ENOENT)
+		ret = __record_deleted_ref(num, dir, index, name, sctx);
+	else if (ret > 0)
+		ret = 0;
+
+	return ret;
+}
+
+static int record_changed_ref(struct send_ctx *sctx)
+{
+	int ret = 0;
+
+	ret = iterate_inode_ref(sctx, sctx->send_root, sctx->left_path,
+			sctx->cmp_key, 0, __record_changed_new_ref, sctx);
+	if (ret < 0)
+		goto out;
+	ret = iterate_inode_ref(sctx, sctx->parent_root, sctx->right_path,
+			sctx->cmp_key, 0, __record_changed_deleted_ref, sctx);
+
+out:
+	return ret;
+}
+
+/*
+ * Record and process all refs at once. Needed when an inode changes the
+ * generation number, which means that it was deleted and recreated.
+ */
+static int process_all_refs(struct send_ctx *sctx,
+			    enum btrfs_compare_tree_result cmd)
+{
+	int ret;
+	struct btrfs_root *root;
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	struct btrfs_key found_key;
+	struct extent_buffer *eb;
+	int slot;
+	iterate_inode_ref_t cb;
+
+	path = alloc_path_for_send();
+	if (!path)
+		return -ENOMEM;
+
+	if (cmd == BTRFS_COMPARE_TREE_NEW) {
+		root = sctx->send_root;
+		cb = __record_new_ref;
+	} else if (cmd == BTRFS_COMPARE_TREE_DELETED) {
+		root = sctx->parent_root;
+		cb = __record_deleted_ref;
+	} else {
+		BUG();
+	}
+
+	key.objectid = sctx->cmp_key->objectid;
+	key.type = BTRFS_INODE_REF_KEY;
+	key.offset = 0;
+	while (1) {
+		ret = btrfs_search_slot_for_read(root, &key, path, 1, 0);
+		if (ret < 0) {
+			btrfs_release_path(path);
+			goto out;
+		}
+		if (ret) {
+			btrfs_release_path(path);
+			break;
+		}
+
+		eb = path->nodes[0];
+		slot = path->slots[0];
+		btrfs_item_key_to_cpu(eb, &found_key, slot);
+
+		if (found_key.objectid != key.objectid ||
+		    found_key.type != key.type) {
+			btrfs_release_path(path);
+			break;
+		}
+
+		ret = iterate_inode_ref(sctx, sctx->parent_root, path,
+				&found_key, 0, cb, sctx);
+		btrfs_release_path(path);
+		if (ret < 0)
+			goto out;
+
+		key.offset = found_key.offset + 1;
+	}
+
+	ret = process_recorded_refs(sctx);
+
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+static int send_set_xattr(struct send_ctx *sctx,
+			  struct fs_path *path,
+			  const char *name, int name_len,
+			  const char *data, int data_len)
+{
+	int ret = 0;
+
+	ret = begin_cmd(sctx, BTRFS_SEND_C_SET_XATTR);
+	if (ret < 0)
+		goto out;
+
+	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path);
+	TLV_PUT_STRING(sctx, BTRFS_SEND_A_XATTR_NAME, name, name_len);
+	TLV_PUT(sctx, BTRFS_SEND_A_XATTR_DATA, data, data_len);
+
+	ret = send_cmd(sctx);
+
+tlv_put_failure:
+out:
+	return ret;
+}
+
+static int send_remove_xattr(struct send_ctx *sctx,
+			  struct fs_path *path,
+			  const char *name, int name_len)
+{
+	int ret = 0;
+
+	ret = begin_cmd(sctx, BTRFS_SEND_C_REMOVE_XATTR);
+	if (ret < 0)
+		goto out;
+
+	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path);
+	TLV_PUT_STRING(sctx, BTRFS_SEND_A_XATTR_NAME, name, name_len);
+
+	ret = send_cmd(sctx);
+
+tlv_put_failure:
+out:
+	return ret;
+}
+
+static int __process_new_xattr(int num, const char *name, int name_len,
+			       const char *data, int data_len,
+			       u8 type, void *ctx)
+{
+	int ret;
+	struct send_ctx *sctx = ctx;
+	struct fs_path *p;
+	posix_acl_xattr_header dummy_acl;
+
+	p = fs_path_alloc(sctx);
+	if (!p)
+		return -ENOMEM;
+
+	/*
+	 * This hack is needed because empty acl's are stored as zero byte
+	 * data in xattrs. Problem with that is, that receiving these zero byte
+	 * acl's will fail later. To fix this, we send a dummy acl list that
+	 * only contains the version number and no entries.
+	 */
+	if (!strncmp(name, XATTR_NAME_POSIX_ACL_ACCESS, name_len) ||
+	    !strncmp(name, XATTR_NAME_POSIX_ACL_DEFAULT, name_len)) {
+		if (data_len == 0) {
+			dummy_acl.a_version =
+					cpu_to_le32(POSIX_ACL_XATTR_VERSION);
+			data = (char *)&dummy_acl;
+			data_len = sizeof(dummy_acl);
+		}
+	}
+
+	ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
+	if (ret < 0)
+		goto out;
+
+	ret = send_set_xattr(sctx, p, name, name_len, data, data_len);
+
+out:
+	fs_path_free(sctx, p);
+	return ret;
+}
+
+static int __process_deleted_xattr(int num, const char *name, int name_len,
+				   const char *data, int data_len,
+				   u8 type, void *ctx)
+{
+	int ret;
+	struct send_ctx *sctx = ctx;
+	struct fs_path *p;
+
+	p = fs_path_alloc(sctx);
+	if (!p)
+		return -ENOMEM;
+
+	ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
+	if (ret < 0)
+		goto out;
+
+	ret = send_remove_xattr(sctx, p, name, name_len);
+
+out:
+	fs_path_free(sctx, p);
+	return ret;
+}
+
+static int process_new_xattr(struct send_ctx *sctx)
+{
+	int ret = 0;
+
+	ret = iterate_dir_item(sctx, sctx->send_root, sctx->left_path,
+			sctx->cmp_key, __process_new_xattr, sctx);
+
+	return ret;
+}
+
+static int process_deleted_xattr(struct send_ctx *sctx)
+{
+	int ret;
+
+	ret = iterate_dir_item(sctx, sctx->parent_root, sctx->right_path,
+			sctx->cmp_key, __process_deleted_xattr, sctx);
+
+	return ret;
+}
+
+struct find_xattr_ctx {
+	const char *name;
+	int name_len;
+	int found_idx;
+	char *found_data;
+	int found_data_len;
+};
+
+static int __find_xattr(int num, const char *name, int name_len,
+			const char *data, int data_len,
+			u8 type, void *vctx)
+{
+	struct find_xattr_ctx *ctx = vctx;
+
+	if (name_len == ctx->name_len &&
+	    strncmp(name, ctx->name, name_len) == 0) {
+		ctx->found_idx = num;
+		ctx->found_data_len = data_len;
+		ctx->found_data = kmalloc(data_len, GFP_NOFS);
+		if (!ctx->found_data)
+			return -ENOMEM;
+		memcpy(ctx->found_data, data, data_len);
+		return 1;
+	}
+	return 0;
+}
+
+static int find_xattr(struct send_ctx *sctx,
+		      struct btrfs_root *root,
+		      struct btrfs_path *path,
+		      struct btrfs_key *key,
+		      const char *name, int name_len,
+		      char **data, int *data_len)
+{
+	int ret;
+	struct find_xattr_ctx ctx;
+
+	ctx.name = name;
+	ctx.name_len = name_len;
+	ctx.found_idx = -1;
+	ctx.found_data = NULL;
+	ctx.found_data_len = 0;
+
+	ret = iterate_dir_item(sctx, root, path, key, __find_xattr, &ctx);
+	if (ret < 0)
+		return ret;
+
+	if (ctx.found_idx == -1)
+		return -ENOENT;
+	if (data) {
+		*data = ctx.found_data;
+		*data_len = ctx.found_data_len;
+	} else {
+		kfree(ctx.found_data);
+	}
+	return ctx.found_idx;
+}
+
+
+static int __process_changed_new_xattr(int num, const char *name, int name_len,
+				       const char *data, int data_len,
+				       u8 type, void *ctx)
+{
+	int ret;
+	struct send_ctx *sctx = ctx;
+	char *found_data = NULL;
+	int found_data_len  = 0;
+	struct fs_path *p = NULL;
+
+	ret = find_xattr(sctx, sctx->parent_root, sctx->right_path,
+			sctx->cmp_key, name, name_len, &found_data,
+			&found_data_len);
+	if (ret == -ENOENT) {
+		ret = __process_new_xattr(num, name, name_len, data, data_len,
+				type, ctx);
+	} else if (ret >= 0) {
+		if (data_len != found_data_len ||
+		    memcmp(data, found_data, data_len)) {
+			ret = __process_new_xattr(num, name, name_len, data,
+					data_len, type, ctx);
+		} else {
+			ret = 0;
+		}
+	}
+
+	kfree(found_data);
+	fs_path_free(sctx, p);
+	return ret;
+}
+
+static int __process_changed_deleted_xattr(int num, const char *name,
+					   int name_len,
+					   const char *data, int data_len,
+					   u8 type, void *ctx)
+{
+	int ret;
+	struct send_ctx *sctx = ctx;
+
+	ret = find_xattr(sctx, sctx->send_root, sctx->left_path, sctx->cmp_key,
+			name, name_len, NULL, NULL);
+	if (ret == -ENOENT)
+		ret = __process_deleted_xattr(num, name, name_len, data,
+				data_len, type, ctx);
+	else if (ret >= 0)
+		ret = 0;
+
+	return ret;
+}
+
+static int process_changed_xattr(struct send_ctx *sctx)
+{
+	int ret = 0;
+
+	ret = iterate_dir_item(sctx, sctx->send_root, sctx->left_path,
+			sctx->cmp_key, __process_changed_new_xattr, sctx);
+	if (ret < 0)
+		goto out;
+	ret = iterate_dir_item(sctx, sctx->parent_root, sctx->right_path,
+			sctx->cmp_key, __process_changed_deleted_xattr, sctx);
+
+out:
+	return ret;
+}
+
+static int process_all_new_xattrs(struct send_ctx *sctx)
+{
+	int ret;
+	struct btrfs_root *root;
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	struct btrfs_key found_key;
+	struct extent_buffer *eb;
+	int slot;
+
+	path = alloc_path_for_send();
+	if (!path)
+		return -ENOMEM;
+
+	root = sctx->send_root;
+
+	key.objectid = sctx->cmp_key->objectid;
+	key.type = BTRFS_XATTR_ITEM_KEY;
+	key.offset = 0;
+	while (1) {
+		ret = btrfs_search_slot_for_read(root, &key, path, 1, 0);
+		if (ret < 0)
+			goto out;
+		if (ret) {
+			ret = 0;
+			goto out;
+		}
+
+		eb = path->nodes[0];
+		slot = path->slots[0];
+		btrfs_item_key_to_cpu(eb, &found_key, slot);
+
+		if (found_key.objectid != key.objectid ||
+		    found_key.type != key.type) {
+			ret = 0;
+			goto out;
+		}
+
+		ret = iterate_dir_item(sctx, root, path, &found_key,
+				__process_new_xattr, sctx);
+		if (ret < 0)
+			goto out;
+
+		btrfs_release_path(path);
+		key.offset = found_key.offset + 1;
+	}
+
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+/*
+ * Read some bytes from the current inode/file and send a write command to
+ * user space.
+ */
+static int send_write(struct send_ctx *sctx, u64 offset, u32 len)
+{
+	int ret = 0;
+	struct fs_path *p;
+	loff_t pos = offset;
+	int readed;
+	mm_segment_t old_fs;
+
+	p = fs_path_alloc(sctx);
+	if (!p)
+		return -ENOMEM;
+
+	/*
+	 * vfs normally only accepts user space buffers for security reasons.
+	 * we only read from the file and also only provide the read_buf buffer
+	 * to vfs. As this buffer does not come from a user space call, it's
+	 * ok to temporary allow kernel space buffers.
+	 */
+	old_fs = get_fs();
+	set_fs(KERNEL_DS);
+
+verbose_printk("btrfs: send_write offset=%llu, len=%d\n", offset, len);
+
+	ret = open_cur_inode_file(sctx);
+	if (ret < 0)
+		goto out;
+
+	ret = vfs_read(sctx->cur_inode_filp, sctx->read_buf, len, &pos);
+	if (ret < 0)
+		goto out;
+	readed = ret;
+	if (!readed)
+		goto out;
+
+	ret = begin_cmd(sctx, BTRFS_SEND_C_WRITE);
+	if (ret < 0)
+		goto out;
+
+	ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
+	if (ret < 0)
+		goto out;
+
+	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
+	TLV_PUT_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, offset);
+	TLV_PUT(sctx, BTRFS_SEND_A_DATA, sctx->read_buf, readed);
+
+	ret = send_cmd(sctx);
+
+tlv_put_failure:
+out:
+	fs_path_free(sctx, p);
+	set_fs(old_fs);
+	if (ret < 0)
+		return ret;
+	return readed;
+}
+
+/*
+ * Send a clone command to user space.
+ */
+static int send_clone(struct send_ctx *sctx,
+		      u64 offset, u32 len,
+		      struct clone_root *clone_root)
+{
+	int ret = 0;
+	struct btrfs_root *clone_root2 = clone_root->root;
+	struct fs_path *p;
+	u64 gen;
+
+verbose_printk("btrfs: send_clone offset=%llu, len=%d, clone_root=%llu, "
+	       "clone_inode=%llu, clone_offset=%llu\n", offset, len,
+		clone_root->root->objectid, clone_root->ino,
+		clone_root->offset);
+
+	p = fs_path_alloc(sctx);
+	if (!p)
+		return -ENOMEM;
+
+	ret = begin_cmd(sctx, BTRFS_SEND_C_CLONE);
+	if (ret < 0)
+		goto out;
+
+	ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
+	if (ret < 0)
+		goto out;
+
+	TLV_PUT_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, offset);
+	TLV_PUT_U64(sctx, BTRFS_SEND_A_CLONE_LEN, len);
+	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
+
+	if (clone_root2 == sctx->send_root) {
+		ret = get_inode_info(sctx->send_root, clone_root->ino, NULL,
+				&gen, NULL, NULL, NULL);
+		if (ret < 0)
+			goto out;
+		ret = get_cur_path(sctx, clone_root->ino, gen, p);
+	} else {
+		ret = get_inode_path(sctx, clone_root2, clone_root->ino, p);
+	}
+	if (ret < 0)
+		goto out;
+
+	TLV_PUT_UUID(sctx, BTRFS_SEND_A_CLONE_UUID,
+			clone_root2->root_item.uuid);
+	TLV_PUT_U64(sctx, BTRFS_SEND_A_CLONE_CTRANSID,
+			clone_root2->root_item.ctransid);
+	TLV_PUT_PATH(sctx, BTRFS_SEND_A_CLONE_PATH, p);
+	TLV_PUT_U64(sctx, BTRFS_SEND_A_CLONE_OFFSET,
+			clone_root->offset);
+
+	ret = send_cmd(sctx);
+
+tlv_put_failure:
+out:
+	fs_path_free(sctx, p);
+	return ret;
+}
+
+static int send_write_or_clone(struct send_ctx *sctx,
+			       struct btrfs_path *path,
+			       struct btrfs_key *key,
+			       struct clone_root *clone_root)
+{
+	int ret = 0;
+	struct btrfs_file_extent_item *ei;
+	u64 offset = key->offset;
+	u64 pos = 0;
+	u64 len;
+	u32 l;
+	u8 type;
+
+	ei = btrfs_item_ptr(path->nodes[0], path->slots[0],
+			struct btrfs_file_extent_item);
+	type = btrfs_file_extent_type(path->nodes[0], ei);
+	if (type == BTRFS_FILE_EXTENT_INLINE)
+		len = btrfs_file_extent_inline_len(path->nodes[0], ei);
+	else
+		len = btrfs_file_extent_num_bytes(path->nodes[0], ei);
+
+	if (offset + len > sctx->cur_inode_size)
+		len = sctx->cur_inode_size - offset;
+	if (len == 0) {
+		ret = 0;
+		goto out;
+	}
+
+	if (!clone_root) {
+		while (pos < len) {
+			l = len - pos;
+			if (l > BTRFS_SEND_READ_SIZE)
+				l = BTRFS_SEND_READ_SIZE;
+			ret = send_write(sctx, pos + offset, l);
+			if (ret < 0)
+				goto out;
+			if (!ret)
+				break;
+			pos += ret;
+		}
+		ret = 0;
+	} else {
+		ret = send_clone(sctx, offset, len, clone_root);
+	}
+
+out:
+	return ret;
+}
+
+static int is_extent_unchanged(struct send_ctx *sctx,
+			       struct btrfs_path *left_path,
+			       struct btrfs_key *ekey)
+{
+	int ret = 0;
+	struct btrfs_key key;
+	struct btrfs_path *path = NULL;
+	struct extent_buffer *eb;
+	int slot;
+	struct btrfs_key found_key;
+	struct btrfs_file_extent_item *ei;
+	u64 left_disknr;
+	u64 right_disknr;
+	u64 left_offset;
+	u64 right_offset;
+	u64 left_len;
+	u64 right_len;
+	u8 left_type;
+	u8 right_type;
+
+	path = alloc_path_for_send();
+	if (!path)
+		return -ENOMEM;
+
+	eb = left_path->nodes[0];
+	slot = left_path->slots[0];
+
+	ei = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item);
+	left_type = btrfs_file_extent_type(eb, ei);
+	left_disknr = btrfs_file_extent_disk_bytenr(eb, ei);
+	left_len = btrfs_file_extent_num_bytes(eb, ei);
+	left_offset = btrfs_file_extent_offset(eb, ei);
+
+	if (left_type != BTRFS_FILE_EXTENT_REG) {
+		ret = 0;
+		goto out;
+	}
+
+	key.objectid = ekey->objectid;
+	key.type = BTRFS_EXTENT_DATA_KEY;
+	key.offset = ekey->offset;
+
+	while (1) {
+		ret = btrfs_search_slot_for_read(sctx->parent_root, &key, path,
+				0, 0);
+		if (ret < 0)
+			goto out;
+		if (ret) {
+			ret = 0;
+			goto out;
+		}
+		btrfs_item_key_to_cpu(path->nodes[0], &found_key,
+				path->slots[0]);
+		if (found_key.objectid != key.objectid ||
+		    found_key.type != key.type) {
+			ret = 0;
+			goto out;
+		}
+
+		eb = path->nodes[0];
+		slot = path->slots[0];
+
+		ei = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item);
+		right_type = btrfs_file_extent_type(eb, ei);
+		right_disknr = btrfs_file_extent_disk_bytenr(eb, ei);
+		right_len = btrfs_file_extent_num_bytes(eb, ei);
+		right_offset = btrfs_file_extent_offset(eb, ei);
+		btrfs_release_path(path);
+
+		if (right_type != BTRFS_FILE_EXTENT_REG) {
+			ret = 0;
+			goto out;
+		}
+
+		if (left_disknr != right_disknr) {
+			ret = 0;
+			goto out;
+		}
+
+		key.offset = found_key.offset + right_len;
+		if (key.offset >= ekey->offset + left_len) {
+			ret = 1;
+			goto out;
+		}
+	}
+
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+static int process_extent(struct send_ctx *sctx,
+			  struct btrfs_path *path,
+			  struct btrfs_key *key)
+{
+	int ret = 0;
+	struct clone_root *found_clone = NULL;
+
+	if (S_ISLNK(sctx->cur_inode_mode))
+		return 0;
+
+	if (sctx->parent_root && !sctx->cur_inode_new) {
+		ret = is_extent_unchanged(sctx, path, key);
+		if (ret < 0)
+			goto out;
+		if (ret) {
+			ret = 0;
+			goto out;
+		}
+	}
+
+	ret = find_extent_clone(sctx, path, key->objectid, key->offset,
+			sctx->cur_inode_size, &found_clone);
+	if (ret != -ENOENT && ret < 0)
+		goto out;
+
+	ret = send_write_or_clone(sctx, path, key, found_clone);
+
+out:
+	return ret;
+}
+
+static int process_all_extents(struct send_ctx *sctx)
+{
+	int ret;
+	struct btrfs_root *root;
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	struct btrfs_key found_key;
+	struct extent_buffer *eb;
+	int slot;
+
+	root = sctx->send_root;
+	path = alloc_path_for_send();
+	if (!path)
+		return -ENOMEM;
+
+	key.objectid = sctx->cmp_key->objectid;
+	key.type = BTRFS_EXTENT_DATA_KEY;
+	key.offset = 0;
+	while (1) {
+		ret = btrfs_search_slot_for_read(root, &key, path, 1, 0);
+		if (ret < 0)
+			goto out;
+		if (ret) {
+			ret = 0;
+			goto out;
+		}
+
+		eb = path->nodes[0];
+		slot = path->slots[0];
+		btrfs_item_key_to_cpu(eb, &found_key, slot);
+
+		if (found_key.objectid != key.objectid ||
+		    found_key.type != key.type) {
+			ret = 0;
+			goto out;
+		}
+
+		ret = process_extent(sctx, path, &found_key);
+		if (ret < 0)
+			goto out;
+
+		btrfs_release_path(path);
+		key.offset = found_key.offset + 1;
+	}
+
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+static int process_recorded_refs_if_needed(struct send_ctx *sctx, int at_end)
+{
+	int ret = 0;
+
+	if (sctx->cur_ino == 0)
+		goto out;
+	if (!at_end && sctx->cur_ino == sctx->cmp_key->objectid &&
+	    sctx->cmp_key->type <= BTRFS_INODE_REF_KEY)
+		goto out;
+	if (list_empty(&sctx->new_refs) && list_empty(&sctx->deleted_refs))
+		goto out;
+
+	ret = process_recorded_refs(sctx);
+
+out:
+	return ret;
+}
+
+static int finish_inode_if_needed(struct send_ctx *sctx, int at_end)
+{
+	int ret = 0;
+	u64 left_mode;
+	u64 left_uid;
+	u64 left_gid;
+	u64 right_mode;
+	u64 right_uid;
+	u64 right_gid;
+	int need_chmod = 0;
+	int need_chown = 0;
+
+	ret = process_recorded_refs_if_needed(sctx, at_end);
+	if (ret < 0)
+		goto out;
+
+	if (sctx->cur_ino == 0 || sctx->cur_inode_deleted)
+		goto out;
+	if (!at_end && sctx->cmp_key->objectid == sctx->cur_ino)
+		goto out;
+
+	ret = get_inode_info(sctx->send_root, sctx->cur_ino, NULL, NULL,
+			&left_mode, &left_uid, &left_gid);
+	if (ret < 0)
+		goto out;
+
+	if (!S_ISLNK(sctx->cur_inode_mode)) {
+		if (!sctx->parent_root || sctx->cur_inode_new) {
+			need_chmod = 1;
+			need_chown = 1;
+		} else {
+			ret = get_inode_info(sctx->parent_root, sctx->cur_ino,
+					NULL, NULL, &right_mode, &right_uid,
+					&right_gid);
+			if (ret < 0)
+				goto out;
+
+			if (left_uid != right_uid || left_gid != right_gid)
+				need_chown = 1;
+			if (left_mode != right_mode)
+				need_chmod = 1;
+		}
+	}
+
+	if (S_ISREG(sctx->cur_inode_mode)) {
+		ret = send_truncate(sctx, sctx->cur_ino, sctx->cur_inode_gen,
+				sctx->cur_inode_size);
+		if (ret < 0)
+			goto out;
+	}
+
+	if (need_chown) {
+		ret = send_chown(sctx, sctx->cur_ino, sctx->cur_inode_gen,
+				left_uid, left_gid);
+		if (ret < 0)
+			goto out;
+	}
+	if (need_chmod) {
+		ret = send_chmod(sctx, sctx->cur_ino, sctx->cur_inode_gen,
+				left_mode);
+		if (ret < 0)
+			goto out;
+	}
+
+	/*
+	 * Need to send that every time, no matter if it actually changed
+	 * between the two trees as we have done changes to the inode before.
+	 */
+	ret = send_utimes(sctx, sctx->cur_ino, sctx->cur_inode_gen);
+	if (ret < 0)
+		goto out;
+
+out:
+	return ret;
+}
+
+static int changed_inode(struct send_ctx *sctx,
+			 enum btrfs_compare_tree_result result)
+{
+	int ret = 0;
+	struct btrfs_key *key = sctx->cmp_key;
+	struct btrfs_inode_item *left_ii = NULL;
+	struct btrfs_inode_item *right_ii = NULL;
+	u64 left_gen = 0;
+	u64 right_gen = 0;
+
+	ret = close_cur_inode_file(sctx);
+	if (ret < 0)
+		goto out;
+
+	sctx->cur_ino = key->objectid;
+	sctx->cur_inode_new_gen = 0;
+	sctx->send_progress = sctx->cur_ino;
+
+	if (result == BTRFS_COMPARE_TREE_NEW ||
+	    result == BTRFS_COMPARE_TREE_CHANGED) {
+		left_ii = btrfs_item_ptr(sctx->left_path->nodes[0],
+				sctx->left_path->slots[0],
+				struct btrfs_inode_item);
+		left_gen = btrfs_inode_generation(sctx->left_path->nodes[0],
+				left_ii);
+	} else {
+		right_ii = btrfs_item_ptr(sctx->right_path->nodes[0],
+				sctx->right_path->slots[0],
+				struct btrfs_inode_item);
+		right_gen = btrfs_inode_generation(sctx->right_path->nodes[0],
+				right_ii);
+	}
+	if (result == BTRFS_COMPARE_TREE_CHANGED) {
+		right_ii = btrfs_item_ptr(sctx->right_path->nodes[0],
+				sctx->right_path->slots[0],
+				struct btrfs_inode_item);
+
+		right_gen = btrfs_inode_generation(sctx->right_path->nodes[0],
+				right_ii);
+		if (left_gen != right_gen)
+			sctx->cur_inode_new_gen = 1;
+	}
+
+	if (result == BTRFS_COMPARE_TREE_NEW) {
+		sctx->cur_inode_gen = left_gen;
+		sctx->cur_inode_new = 1;
+		sctx->cur_inode_deleted = 0;
+		sctx->cur_inode_size = btrfs_inode_size(
+				sctx->left_path->nodes[0], left_ii);
+		sctx->cur_inode_mode = btrfs_inode_mode(
+				sctx->left_path->nodes[0], left_ii);
+		if (sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID)
+			ret = send_create_inode(sctx, sctx->left_path,
+					sctx->cmp_key);
+	} else if (result == BTRFS_COMPARE_TREE_DELETED) {
+		sctx->cur_inode_gen = right_gen;
+		sctx->cur_inode_new = 0;
+		sctx->cur_inode_deleted = 1;
+		sctx->cur_inode_size = btrfs_inode_size(
+				sctx->right_path->nodes[0], right_ii);
+		sctx->cur_inode_mode = btrfs_inode_mode(
+				sctx->right_path->nodes[0], right_ii);
+	} else if (result == BTRFS_COMPARE_TREE_CHANGED) {
+		if (sctx->cur_inode_new_gen) {
+			sctx->cur_inode_gen = right_gen;
+			sctx->cur_inode_new = 0;
+			sctx->cur_inode_deleted = 1;
+			sctx->cur_inode_size = btrfs_inode_size(
+					sctx->right_path->nodes[0], right_ii);
+			sctx->cur_inode_mode = btrfs_inode_mode(
+					sctx->right_path->nodes[0], right_ii);
+			ret = process_all_refs(sctx,
+					BTRFS_COMPARE_TREE_DELETED);
+			if (ret < 0)
+				goto out;
+
+			sctx->cur_inode_gen = left_gen;
+			sctx->cur_inode_new = 1;
+			sctx->cur_inode_deleted = 0;
+			sctx->cur_inode_size = btrfs_inode_size(
+					sctx->left_path->nodes[0], left_ii);
+			sctx->cur_inode_mode = btrfs_inode_mode(
+					sctx->left_path->nodes[0], left_ii);
+			ret = send_create_inode(sctx, sctx->left_path,
+					sctx->cmp_key);
+			if (ret < 0)
+				goto out;
+
+			ret = process_all_refs(sctx, BTRFS_COMPARE_TREE_NEW);
+			if (ret < 0)
+				goto out;
+			ret = process_all_extents(sctx);
+			if (ret < 0)
+				goto out;
+			ret = process_all_new_xattrs(sctx);
+			if (ret < 0)
+				goto out;
+		} else {
+			sctx->cur_inode_gen = left_gen;
+			sctx->cur_inode_new = 0;
+			sctx->cur_inode_new_gen = 0;
+			sctx->cur_inode_deleted = 0;
+			sctx->cur_inode_size = btrfs_inode_size(
+					sctx->left_path->nodes[0], left_ii);
+			sctx->cur_inode_mode = btrfs_inode_mode(
+					sctx->left_path->nodes[0], left_ii);
+		}
+	}
+
+out:
+	return ret;
+}
+
+static int changed_ref(struct send_ctx *sctx,
+		       enum btrfs_compare_tree_result result)
+{
+	int ret = 0;
+
+	BUG_ON(sctx->cur_ino != sctx->cmp_key->objectid);
+
+	if (!sctx->cur_inode_new_gen &&
+	    sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID) {
+		if (result == BTRFS_COMPARE_TREE_NEW)
+			ret = record_new_ref(sctx);
+		else if (result == BTRFS_COMPARE_TREE_DELETED)
+			ret = record_deleted_ref(sctx);
+		else if (result == BTRFS_COMPARE_TREE_CHANGED)
+			ret = record_changed_ref(sctx);
+	}
+
+	return ret;
+}
+
+static int changed_xattr(struct send_ctx *sctx,
+			 enum btrfs_compare_tree_result result)
+{
+	int ret = 0;
+
+	BUG_ON(sctx->cur_ino != sctx->cmp_key->objectid);
+
+	if (!sctx->cur_inode_new_gen && !sctx->cur_inode_deleted) {
+		if (result == BTRFS_COMPARE_TREE_NEW)
+			ret = process_new_xattr(sctx);
+		else if (result == BTRFS_COMPARE_TREE_DELETED)
+			ret = process_deleted_xattr(sctx);
+		else if (result == BTRFS_COMPARE_TREE_CHANGED)
+			ret = process_changed_xattr(sctx);
+	}
+
+	return ret;
+}
+
+static int changed_extent(struct send_ctx *sctx,
+			  enum btrfs_compare_tree_result result)
+{
+	int ret = 0;
+
+	BUG_ON(sctx->cur_ino != sctx->cmp_key->objectid);
+
+	if (!sctx->cur_inode_new_gen && !sctx->cur_inode_deleted) {
+		if (result != BTRFS_COMPARE_TREE_DELETED)
+			ret = process_extent(sctx, sctx->left_path,
+					sctx->cmp_key);
+	}
+
+	return ret;
+}
+
+
+static int changed_cb(struct btrfs_root *left_root,
+		      struct btrfs_root *right_root,
+		      struct btrfs_path *left_path,
+		      struct btrfs_path *right_path,
+		      struct btrfs_key *key,
+		      enum btrfs_compare_tree_result result,
+		      void *ctx)
+{
+	int ret = 0;
+	struct send_ctx *sctx = ctx;
+
+	sctx->left_path = left_path;
+	sctx->right_path = right_path;
+	sctx->cmp_key = key;
+
+	ret = finish_inode_if_needed(sctx, 0);
+	if (ret < 0)
+		goto out;
+
+	if (key->type == BTRFS_INODE_ITEM_KEY)
+		ret = changed_inode(sctx, result);
+	else if (key->type == BTRFS_INODE_REF_KEY)
+		ret = changed_ref(sctx, result);
+	else if (key->type == BTRFS_XATTR_ITEM_KEY)
+		ret = changed_xattr(sctx, result);
+	else if (key->type == BTRFS_EXTENT_DATA_KEY)
+		ret = changed_extent(sctx, result);
+
+out:
+	return ret;
+}
+
+static int full_send_tree(struct send_ctx *sctx)
+{
+	int ret;
+	struct btrfs_trans_handle *trans = NULL;
+	struct btrfs_root *send_root = sctx->send_root;
+	struct btrfs_key key;
+	struct btrfs_key found_key;
+	struct btrfs_path *path;
+	struct extent_buffer *eb;
+	int slot;
+	u64 start_ctransid;
+	u64 ctransid;
+
+	path = alloc_path_for_send();
+	if (!path)
+		return -ENOMEM;
+
+	spin_lock(&send_root->root_times_lock);
+	start_ctransid = btrfs_root_ctransid(&send_root->root_item);
+	spin_unlock(&send_root->root_times_lock);
+
+	key.objectid = BTRFS_FIRST_FREE_OBJECTID;
+	key.type = BTRFS_INODE_ITEM_KEY;
+	key.offset = 0;
+
+join_trans:
+	/*
+	 * We need to make sure the transaction does not get committed
+	 * while we do anything on commit roots. Join a transaction to prevent
+	 * this.
+	 */
+	trans = btrfs_join_transaction(send_root);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		trans = NULL;
+		goto out;
+	}
+
+	/*
+	 * Make sure the tree has not changed
+	 */
+	spin_lock(&send_root->root_times_lock);
+	ctransid = btrfs_root_ctransid(&send_root->root_item);
+	spin_unlock(&send_root->root_times_lock);
+
+	if (ctransid != start_ctransid) {
+		WARN(1, KERN_WARNING "btrfs: the root that you're trying to "
+				     "send was modified in between. This is "
+				     "probably a bug.\n");
+		ret = -EIO;
+		goto out;
+	}
+
+	ret = btrfs_search_slot_for_read(send_root, &key, path, 1, 0);
+	if (ret < 0)
+		goto out;
+	if (ret)
+		goto out_finish;
+
+	while (1) {
+		/*
+		 * When someone want to commit while we iterate, end the
+		 * joined transaction and rejoin.
+		 */
+		if (btrfs_should_end_transaction(trans, send_root)) {
+			ret = btrfs_end_transaction(trans, send_root);
+			trans = NULL;
+			if (ret < 0)
+				goto out;
+			btrfs_release_path(path);
+			goto join_trans;
+		}
+
+		eb = path->nodes[0];
+		slot = path->slots[0];
+		btrfs_item_key_to_cpu(eb, &found_key, slot);
+
+		ret = changed_cb(send_root, NULL, path, NULL,
+				&found_key, BTRFS_COMPARE_TREE_NEW, sctx);
+		if (ret < 0)
+			goto out;
+
+		key.objectid = found_key.objectid;
+		key.type = found_key.type;
+		key.offset = found_key.offset + 1;
+
+		ret = btrfs_next_item(send_root, path);
+		if (ret < 0)
+			goto out;
+		if (ret) {
+			ret  = 0;
+			break;
+		}
+	}
+
+out_finish:
+	ret = finish_inode_if_needed(sctx, 1);
+
+out:
+	btrfs_free_path(path);
+	if (trans) {
+		if (!ret)
+			ret = btrfs_end_transaction(trans, send_root);
+		else
+			btrfs_end_transaction(trans, send_root);
+	}
+	return ret;
+}
+
+static int send_subvol(struct send_ctx *sctx)
+{
+	int ret;
+
+	ret = send_header(sctx);
+	if (ret < 0)
+		goto out;
+
+	ret = send_subvol_begin(sctx);
+	if (ret < 0)
+		goto out;
+
+	if (sctx->parent_root) {
+		ret = btrfs_compare_trees(sctx->send_root, sctx->parent_root,
+				changed_cb, sctx);
+		if (ret < 0)
+			goto out;
+		ret = finish_inode_if_needed(sctx, 1);
+		if (ret < 0)
+			goto out;
+	} else {
+		ret = full_send_tree(sctx);
+		if (ret < 0)
+			goto out;
+	}
+
+out:
+	if (!ret)
+		ret = close_cur_inode_file(sctx);
+	else
+		close_cur_inode_file(sctx);
+
+	free_recorded_refs(sctx);
+	return ret;
+}
+
+long btrfs_ioctl_send(struct file *mnt_file, void __user *arg_)
+{
+	int ret = 0;
+	struct btrfs_root *send_root;
+	struct btrfs_root *clone_root;
+	struct btrfs_fs_info *fs_info;
+	struct btrfs_ioctl_send_args *arg = NULL;
+	struct btrfs_key key;
+	struct file *filp = NULL;
+	struct send_ctx *sctx = NULL;
+	u32 i;
+	u64 *clone_sources_tmp = NULL;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	send_root = BTRFS_I(fdentry(mnt_file)->d_inode)->root;
+	fs_info = send_root->fs_info;
+
+	arg = memdup_user(arg_, sizeof(*arg));
+	if (IS_ERR(arg)) {
+		ret = PTR_ERR(arg);
+		arg = NULL;
+		goto out;
+	}
+
+	if (!access_ok(VERIFY_READ, arg->clone_sources,
+			sizeof(*arg->clone_sources *
+			arg->clone_sources_count))) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	sctx = kzalloc(sizeof(struct send_ctx), GFP_NOFS);
+	if (!sctx) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	INIT_LIST_HEAD(&sctx->new_refs);
+	INIT_LIST_HEAD(&sctx->deleted_refs);
+	INIT_RADIX_TREE(&sctx->name_cache, GFP_NOFS);
+	INIT_LIST_HEAD(&sctx->name_cache_list);
+
+	sctx->send_filp = fget(arg->send_fd);
+	if (IS_ERR(sctx->send_filp)) {
+		ret = PTR_ERR(sctx->send_filp);
+		goto out;
+	}
+
+	sctx->mnt = mnt_file->f_path.mnt;
+
+	sctx->send_root = send_root;
+	sctx->clone_roots_cnt = arg->clone_sources_count;
+
+	sctx->send_max_size = BTRFS_SEND_BUF_SIZE;
+	sctx->send_buf = vmalloc(sctx->send_max_size);
+	if (!sctx->send_buf) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	sctx->read_buf = vmalloc(BTRFS_SEND_READ_SIZE);
+	if (!sctx->read_buf) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	sctx->clone_roots = vzalloc(sizeof(struct clone_root) *
+			(arg->clone_sources_count + 1));
+	if (!sctx->clone_roots) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (arg->clone_sources_count) {
+		clone_sources_tmp = vmalloc(arg->clone_sources_count *
+				sizeof(*arg->clone_sources));
+		if (!clone_sources_tmp) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		ret = copy_from_user(clone_sources_tmp, arg->clone_sources,
+				arg->clone_sources_count *
+				sizeof(*arg->clone_sources));
+		if (ret) {
+			ret = -EFAULT;
+			goto out;
+		}
+
+		for (i = 0; i < arg->clone_sources_count; i++) {
+			key.objectid = clone_sources_tmp[i];
+			key.type = BTRFS_ROOT_ITEM_KEY;
+			key.offset = (u64)-1;
+			clone_root = btrfs_read_fs_root_no_name(fs_info, &key);
+			if (!clone_root) {
+				ret = -EINVAL;
+				goto out;
+			}
+			if (IS_ERR(clone_root)) {
+				ret = PTR_ERR(clone_root);
+				goto out;
+			}
+			sctx->clone_roots[i].root = clone_root;
+		}
+		vfree(clone_sources_tmp);
+		clone_sources_tmp = NULL;
+	}
+
+	if (arg->parent_root) {
+		key.objectid = arg->parent_root;
+		key.type = BTRFS_ROOT_ITEM_KEY;
+		key.offset = (u64)-1;
+		sctx->parent_root = btrfs_read_fs_root_no_name(fs_info, &key);
+		if (!sctx->parent_root) {
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	/*
+	 * Clones from send_root are allowed, but only if the clone source
+	 * is behind the current send position. This is checked while searching
+	 * for possible clone sources.
+	 */
+	sctx->clone_roots[sctx->clone_roots_cnt++].root = sctx->send_root;
+
+	/* We do a bsearch later */
+	sort(sctx->clone_roots, sctx->clone_roots_cnt,
+			sizeof(*sctx->clone_roots), __clone_root_cmp_sort,
+			NULL);
+
+	ret = send_subvol(sctx);
+	if (ret < 0)
+		goto out;
+
+	ret = begin_cmd(sctx, BTRFS_SEND_C_END);
+	if (ret < 0)
+		goto out;
+	ret = send_cmd(sctx);
+	if (ret < 0)
+		goto out;
+
+out:
+	if (filp)
+		fput(filp);
+	kfree(arg);
+	vfree(clone_sources_tmp);
+
+	if (sctx) {
+		if (sctx->send_filp)
+			fput(sctx->send_filp);
+
+		vfree(sctx->clone_roots);
+		vfree(sctx->send_buf);
+		vfree(sctx->read_buf);
+
+		name_cache_free(sctx);
+
+		kfree(sctx);
+	}
+
+	return ret;
+}
diff --git a/fs/btrfs/send.h b/fs/btrfs/send.h
index a4c23ee..53f8ee7 100644
--- a/fs/btrfs/send.h
+++ b/fs/btrfs/send.h
@@ -124,3 +124,7 @@ enum {
 	__BTRFS_SEND_A_MAX,
 };
 #define BTRFS_SEND_A_MAX (__BTRFS_SEND_A_MAX - 1)
+
+#ifdef __KERNEL__
+long btrfs_ioctl_send(struct file *mnt_file, void __user *arg);
+#endif
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 5/7] Btrfs: add btrfs_compare_trees function
  2012-07-04 13:38 ` [RFC PATCH 5/7] Btrfs: add btrfs_compare_trees function Alexander Block
@ 2012-07-04 18:27   ` Alex Lyakas
  2012-07-04 19:49     ` Alexander Block
  2012-07-04 19:13   ` Alex Lyakas
  1 sibling, 1 reply; 43+ messages in thread
From: Alex Lyakas @ 2012-07-04 18:27 UTC (permalink / raw)
  To: Alexander Block; +Cc: linux-btrfs

Hi Alex,

> +                       spin_lock(&left_root->root_times_lock);
> +                       ctransid = btrfs_root_ctransid(&left_root->root_item);
> +                       spin_unlock(&left_root->root_times_lock);
> +                       if (ctransid != left_start_ctransid)
> +                               left_start_ctransid = 0;
> +
> +                       spin_lock(&right_root->root_times_lock);
> +                       ctransid = btrfs_root_ctransid(&right_root->root_item);
> +                       spin_unlock(&right_root->root_times_lock);
> +                       if (ctransid != right_start_ctransid)
> +                               left_start_ctransid = 0;
Shouldn't it be here right_start_ctransid=0? Otherwise,
right_start_ctransid is pretty useless in this function.

> +
> +                       if (!left_start_ctransid || !right_start_ctransid) {
> +                               WARN(1, KERN_WARNING
> +                                       "btrfs: btrfs_compare_tree detected "
> +                                       "a change in one of the trees while "
> +                                       "iterating. This is probably a "
> +                                       "bug.\n");
> +                               ret = -EIO;
> +                               goto out;
> +                       }

I am reading the code have more questions (and comments), but will
send them all later.

Alex.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 5/7] Btrfs: add btrfs_compare_trees function
  2012-07-04 13:38 ` [RFC PATCH 5/7] Btrfs: add btrfs_compare_trees function Alexander Block
  2012-07-04 18:27   ` Alex Lyakas
@ 2012-07-04 19:13   ` Alex Lyakas
  2012-07-04 20:18     ` Alexander Block
  1 sibling, 1 reply; 43+ messages in thread
From: Alex Lyakas @ 2012-07-04 19:13 UTC (permalink / raw)
  To: Alexander Block; +Cc: linux-btrfs

Hi Alex,

> +static int tree_compare_item(struct btrfs_root *left_root,
> +                            struct btrfs_path *left_path,
> +                            struct btrfs_path *right_path,
> +                            char *tmp_buf)
> +{
> +       int cmp;
> +       int len1, len2;
> +       unsigned long off1, off2;
> +
> +       len1 = btrfs_item_size_nr(left_path->nodes[0], left_path->slots[0]);
> +       len2 = btrfs_item_size_nr(right_path->nodes[0], right_path->slots[0]);
> +       if (len1 != len2)
> +               return 1;
> +
> +       off1 = btrfs_item_ptr_offset(left_path->nodes[0], left_path->slots[0]);
> +       off2 = btrfs_item_ptr_offset(right_path->nodes[0],
> +                               right_path->slots[0]);
> +
> +       read_extent_buffer(left_path->nodes[0], tmp_buf, off1, len1);
> +
> +       cmp = memcmp_extent_buffer(right_path->nodes[0], tmp_buf, off2, len1);
> +       if (cmp)
> +               return 1;
> +       return 0;
> +}
It might be worth to note in the comment, that tmp_buff should be
large enough to hold the item from the left tree. Can it happen that
the right tree has a different leafsize?

> +       /*
> +        * Strategy: Go to the first items of both trees. Then do
> +        *
> +        * If both trees are at level 0
> +        *   Compare keys of current items
> +        *     If left < right treat left item as new, advance left tree
> +        *       and repeat
> +        *     If left > right treat right item as deleted, advance right tree
> +        *       and repeat
> +        *     If left == right do deep compare of items, treat as changed if
> +        *       needed, advance both trees and repeat
> +        * If both trees are at the same level but not at level 0
> +        *   Compare keys of current nodes/leafs
> +        *     If left < right advance left tree and repeat
> +        *     If left > right advance right tree and repeat
> +        *     If left == right compare blockptrs of the next nodes/leafs
> +        *       If they match advance both trees but stay at the same level
> +        *         and repeat
> +        *       If they don't match advance both trees while allowing to go
> +        *         deeper and repeat
> +        * If tree levels are different
> +        *   Advance the tree that needs it and repeat
> +        *
> +        * Advancing a tree means:
> +        *   If we are at level 0, try to go to the next slot. If that's not
> +        *   possible, go one level up and repeat. Stop when we found a level
> +        *   where we could go to the next slot. We may at this point be on a
> +        *   node or a leaf.
> +        *
> +        *   If we are not at level 0 and not on shared tree blocks, go one
> +        *   level deeper.
> +        *
> +        *   If we are not at level 0 and on shared tree blocks, go one slot to
> +        *   the right if possible or go up and right.
> +        */
According to the strategy and to the code later, "left" tree is
treated as "newer one", while "right" as "older one", correct? Do you
think it would be more intuitive to make it the other way around,
although I guess this is a matter of personal taste. I had to draw the
leafs reversed to keep going:
R       L
-----     -----
| | | |     | | | |
-----     -----


> +               if (advance_left && !left_end_reached) {
> +                       ret = tree_advance(left_root, left_path, &left_level,
> +                                       left_root_level,
> +                                       advance_left != ADVANCE_ONLY_NEXT,
> +                                       &left_key);
> +                       if (ret < 0)
> +                               left_end_reached = ADVANCE;
> +                       advance_left = 0;
> +               }
> +               if (advance_right && !right_end_reached) {
> +                       ret = tree_advance(right_root, right_path, &right_level,
> +                                       right_root_level,
> +                                       advance_right != ADVANCE_ONLY_NEXT,
> +                                       &right_key);
> +                       if (ret < 0)
> +                               right_end_reached = ADVANCE;
> +                       advance_right = 0;
> +               }
Do you think it's worth it to put a check/warning/smth before that,
that either advance_right or advance_left is non-zero, or we have
reached ends in both trees?


> +               } else if (left_level == right_level) {
...
> +               } else if (left_level < right_level) {
> +                       advance_right = ADVANCE;
> +               } else {
> +                       advance_left = ADVANCE;
> +               }
Can you pls explain why it is correct?
Why if we are on lower level in the "newer" tree than we are in the
"older" tree, we need to advance the "older" tree? I.e., why this
implies that we are on the lower key in the "older" tree? (And
vice-versa). I.e., how difference in levels indicates relation between
keys?

Thanks,
  Alex.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 5/7] Btrfs: add btrfs_compare_trees function
  2012-07-04 18:27   ` Alex Lyakas
@ 2012-07-04 19:49     ` Alexander Block
  0 siblings, 0 replies; 43+ messages in thread
From: Alexander Block @ 2012-07-04 19:49 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs

On Wed, Jul 4, 2012 at 8:27 PM, Alex Lyakas
<alex.bolshoy.btrfs@gmail.com> wrote:
> Hi Alex,
>
>> +                       spin_lock(&left_root->root_times_lock);
>> +                       ctransid = btrfs_root_ctransid(&left_root->root_item);
>> +                       spin_unlock(&left_root->root_times_lock);
>> +                       if (ctransid != left_start_ctransid)
>> +                               left_start_ctransid = 0;
>> +
>> +                       spin_lock(&right_root->root_times_lock);
>> +                       ctransid = btrfs_root_ctransid(&right_root->root_item);
>> +                       spin_unlock(&right_root->root_times_lock);
>> +                       if (ctransid != right_start_ctransid)
>> +                               left_start_ctransid = 0;
> Shouldn't it be here right_start_ctransid=0? Otherwise,
> right_start_ctransid is pretty useless in this function.
>
Hmm you're right, it should be right_start_ctransid. However...the
code was working by accident because the next if does check for left
and right :)
Fixed that in my git repo.
>> +
>> +                       if (!left_start_ctransid || !right_start_ctransid) {
>> +                               WARN(1, KERN_WARNING
>> +                                       "btrfs: btrfs_compare_tree detected "
>> +                                       "a change in one of the trees while "
>> +                                       "iterating. This is probably a "
>> +                                       "bug.\n");
>> +                               ret = -EIO;
>> +                               goto out;
>> +                       }
>
> I am reading the code have more questions (and comments), but will
> send them all later.
>
> Alex.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 5/7] Btrfs: add btrfs_compare_trees function
  2012-07-04 19:13   ` Alex Lyakas
@ 2012-07-04 20:18     ` Alexander Block
  2012-07-04 23:31       ` David Sterba
  2012-07-05 12:19       ` Alex Lyakas
  0 siblings, 2 replies; 43+ messages in thread
From: Alexander Block @ 2012-07-04 20:18 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs

On Wed, Jul 4, 2012 at 9:13 PM, Alex Lyakas
<alex.bolshoy.btrfs@gmail.com> wrote:
> Hi Alex,
>
>> +static int tree_compare_item(struct btrfs_root *left_root,
>> +                            struct btrfs_path *left_path,
>> +                            struct btrfs_path *right_path,
>> +                            char *tmp_buf)
>> +{
>> +       int cmp;
>> +       int len1, len2;
>> +       unsigned long off1, off2;
>> +
>> +       len1 = btrfs_item_size_nr(left_path->nodes[0], left_path->slots[0]);
>> +       len2 = btrfs_item_size_nr(right_path->nodes[0], right_path->slots[0]);
>> +       if (len1 != len2)
>> +               return 1;
>> +
>> +       off1 = btrfs_item_ptr_offset(left_path->nodes[0], left_path->slots[0]);
>> +       off2 = btrfs_item_ptr_offset(right_path->nodes[0],
>> +                               right_path->slots[0]);
>> +
>> +       read_extent_buffer(left_path->nodes[0], tmp_buf, off1, len1);
>> +
>> +       cmp = memcmp_extent_buffer(right_path->nodes[0], tmp_buf, off2, len1);
>> +       if (cmp)
>> +               return 1;
>> +       return 0;
>> +}
> It might be worth to note in the comment, that tmp_buff should be
> large enough to hold the item from the left tree. Can it happen that
> the right tree has a different leafsize?
>
This function is only to be used for for the tree compare function and
there we allocate a buffer of root->leafsize, so definitely all items
should fit. As far as I know, Chris (please correct me if I'm wrong)
once guaranteed that ALL trees in a FS will have the same leaf size
and this will ever be the case.
>> +       /*
>> +        * Strategy: Go to the first items of both trees. Then do
>> +        *
>> +        * If both trees are at level 0
>> +        *   Compare keys of current items
>> +        *     If left < right treat left item as new, advance left tree
>> +        *       and repeat
>> +        *     If left > right treat right item as deleted, advance right tree
>> +        *       and repeat
>> +        *     If left == right do deep compare of items, treat as changed if
>> +        *       needed, advance both trees and repeat
>> +        * If both trees are at the same level but not at level 0
>> +        *   Compare keys of current nodes/leafs
>> +        *     If left < right advance left tree and repeat
>> +        *     If left > right advance right tree and repeat
>> +        *     If left == right compare blockptrs of the next nodes/leafs
>> +        *       If they match advance both trees but stay at the same level
>> +        *         and repeat
>> +        *       If they don't match advance both trees while allowing to go
>> +        *         deeper and repeat
>> +        * If tree levels are different
>> +        *   Advance the tree that needs it and repeat
>> +        *
>> +        * Advancing a tree means:
>> +        *   If we are at level 0, try to go to the next slot. If that's not
>> +        *   possible, go one level up and repeat. Stop when we found a level
>> +        *   where we could go to the next slot. We may at this point be on a
>> +        *   node or a leaf.
>> +        *
>> +        *   If we are not at level 0 and not on shared tree blocks, go one
>> +        *   level deeper.
>> +        *
>> +        *   If we are not at level 0 and on shared tree blocks, go one slot to
>> +        *   the right if possible or go up and right.
>> +        */
> According to the strategy and to the code later, "left" tree is
> treated as "newer one", while "right" as "older one", correct? Do you
> think it would be more intuitive to make it the other way around,
> although I guess this is a matter of personal taste. I had to draw the
> leafs reversed to keep going:
> R       L
> -----     -----
> | | | |     | | | |
> -----     -----
>
>
To be honest...I always preferred the way you suggested in the past
when I thought about compares. But for some reason, I didn't even
think about that and just implemented that function in single
flow...it took days until I've even noticed that I swapped left/right
in my head :D I now would like to stay with that, as all the btrfs
send code uses left/right in this way and I never had the problem with
mixing that up again. If people like, I have nothing against changing
that later if someone wants to, but that's nothing I would like to do
myself.
>> +               if (advance_left && !left_end_reached) {
>> +                       ret = tree_advance(left_root, left_path, &left_level,
>> +                                       left_root_level,
>> +                                       advance_left != ADVANCE_ONLY_NEXT,
>> +                                       &left_key);
>> +                       if (ret < 0)
>> +                               left_end_reached = ADVANCE;
>> +                       advance_left = 0;
>> +               }
>> +               if (advance_right && !right_end_reached) {
>> +                       ret = tree_advance(right_root, right_path, &right_level,
>> +                                       right_root_level,
>> +                                       advance_right != ADVANCE_ONLY_NEXT,
>> +                                       &right_key);
>> +                       if (ret < 0)
>> +                               right_end_reached = ADVANCE;
>> +                       advance_right = 0;
>> +               }
> Do you think it's worth it to put a check/warning/smth before that,
> that either advance_right or advance_left is non-zero, or we have
> reached ends in both trees?
>
>
Having the left or right end reached before the other sides end is
reached is something that is completely normal and expected.
>> +               } else if (left_level == right_level) {
> ...
>> +               } else if (left_level < right_level) {
>> +                       advance_right = ADVANCE;
>> +               } else {
>> +                       advance_left = ADVANCE;
>> +               }
> Can you pls explain why it is correct?
> Why if we are on lower level in the "newer" tree than we are in the
> "older" tree, we need to advance the "older" tree? I.e., why this
> implies that we are on the lower key in the "older" tree? (And
> vice-versa). I.e., how difference in levels indicates relation between
> keys?
Difference in levels has no relation to the keys. These advances
basically try to keep the two trees positions "in-sync". The compare
always tries to get both trees to a point where they are at the same
level, as only then we can compare keys. Also, the two trees may have
different root levels, this code also handles that case.
>
> Thanks,
>   Alex.
Thanks for the review :)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 5/7] Btrfs: add btrfs_compare_trees function
  2012-07-04 20:18     ` Alexander Block
@ 2012-07-04 23:31       ` David Sterba
  2012-07-05 12:19       ` Alex Lyakas
  1 sibling, 0 replies; 43+ messages in thread
From: David Sterba @ 2012-07-04 23:31 UTC (permalink / raw)
  To: Alexander Block; +Cc: Alex Lyakas, linux-btrfs

On Wed, Jul 04, 2012 at 10:18:34PM +0200, Alexander Block wrote:
> > It might be worth to note in the comment, that tmp_buff should be
> > large enough to hold the item from the left tree. Can it happen that
> > the right tree has a different leafsize?
> >
> This function is only to be used for for the tree compare function and
> there we allocate a buffer of root->leafsize, so definitely all items
> should fit. As far as I know, Chris (please correct me if I'm wrong)
> once guaranteed that ALL trees in a FS will have the same leaf size
> and this will ever be the case.

Not only leaves are of the same size in all trees, but also nodes, since
the metadata bigblocks patches.

david

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 4/7] Btrfs: introduce subvol uuids and times
  2012-07-04 13:38 ` [RFC PATCH 4/7] Btrfs: introduce subvol uuids and times Alexander Block
@ 2012-07-05 11:51   ` Alexander Block
  2012-07-05 17:08   ` Zach Brown
  2012-07-16 14:56   ` Arne Jansen
  2 siblings, 0 replies; 43+ messages in thread
From: Alexander Block @ 2012-07-05 11:51 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Alexander Block

On Wed, Jul 4, 2012 at 3:38 PM, Alexander Block <ablock84@googlemail.com> wrote:
> This patch introduces uuids for subvolumes.
> [...]
Stefan and Jan pointed out a problem with this patch that would result
in read_extent_buffer calls that read beyond the leaf size when an old
root item is found at the end of a leaf. I pushed a fix to github that
tries to replace read_extent_buffer calls with a new helper function
(btrfs_read_root_item). The patch also adds a __le64 reserved[8] field
to the root_item struct, which means that when you apply that fix all
your extra information in the root items will get lost on next mount.

As already said in the cover letter, I will not send patches to the
list before it's worth for version 2. Such fixes are only found in my
git repo and will later be squashed together for v2.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 5/7] Btrfs: add btrfs_compare_trees function
  2012-07-04 20:18     ` Alexander Block
  2012-07-04 23:31       ` David Sterba
@ 2012-07-05 12:19       ` Alex Lyakas
  2012-07-05 12:47         ` Alexander Block
  1 sibling, 1 reply; 43+ messages in thread
From: Alex Lyakas @ 2012-07-05 12:19 UTC (permalink / raw)
  To: Alexander Block; +Cc: linux-btrfs

Alexander,

>>> +               if (advance_left && !left_end_reached) {
>>> +                       ret = tree_advance(left_root, left_path, &left_level,
>>> +                                       left_root_level,
>>> +                                       advance_left != ADVANCE_ONLY_NEXT,
>>> +                                       &left_key);
>>> +                       if (ret < 0)
>>> +                               left_end_reached = ADVANCE;
>>> +                       advance_left = 0;
>>> +               }
>>> +               if (advance_right && !right_end_reached) {
>>> +                       ret = tree_advance(right_root, right_path, &right_level,
>>> +                                       right_root_level,
>>> +                                       advance_right != ADVANCE_ONLY_NEXT,
>>> +                                       &right_key);
>>> +                       if (ret < 0)
>>> +                               right_end_reached = ADVANCE;
>>> +                       advance_right = 0;
>>> +               }
>> Do you think it's worth it to put a check/warning/smth before that,
>> that either advance_right or advance_left is non-zero, or we have
>> reached ends in both trees?
>>
>>
> Having the left or right end reached before the other sides end is
> reached is something that is completely normal and expected.
What I meant was that when we start the loop, either advance_left!=0
or advance_right!=0. So I thought it would be good to notice that.
However, on the very first loop iteration, both of them are zero, so I
was wrong.


>>> +               } else if (left_level == right_level) {
>> ...
>>> +               } else if (left_level < right_level) {
>>> +                       advance_right = ADVANCE;
>>> +               } else {
>>> +                       advance_left = ADVANCE;
>>> +               }
>> Can you pls explain why it is correct?
>> Why if we are on lower level in the "newer" tree than we are in the
>> "older" tree, we need to advance the "older" tree? I.e., why this
>> implies that we are on the lower key in the "older" tree? (And
>> vice-versa). I.e., how difference in levels indicates relation between
>> keys?
> Difference in levels has no relation to the keys. These advances
> basically try to keep the two trees positions "in-sync". The compare
> always tries to get both trees to a point where they are at the same
> level, as only then we can compare keys. Also, the two trees may have
> different root levels, this code also handles that case.

Can you pls tell me if I understand your algorithm correctly:
Basically, we need to get to the leaf levels and compare the items in
the leafs. Only when we are on the leaf level, we can safely signal
deletions and additions of items, not on upper levels.
There is only one optimization: we want to find nodes that are shared,
and such nodes can be only on the same level. To make this
optimization happen, we try to always match the levels of the tree.
This is the purpose of:
		} else if (left_level < right_level) {
			advance_right = ADVANCE;
		} else {
			advance_left = ADVANCE;
		}

Note: I think that instead of comparing levels, we could always
compare keys and ADVANCE the lower key. (Because on ADVANCing we never
loose information, we just get closer to leafs, so we don't skip
anything.) But then there is less chance of optimization. Does this
make sense? So what you said that we can compare keys only on the same
level...we can always compare them, correct?

I will now study the rest of your patchset.

Thanks!
Alex.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 5/7] Btrfs: add btrfs_compare_trees function
  2012-07-05 12:19       ` Alex Lyakas
@ 2012-07-05 12:47         ` Alexander Block
  2012-07-05 13:04           ` Alex Lyakas
  0 siblings, 1 reply; 43+ messages in thread
From: Alexander Block @ 2012-07-05 12:47 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs

On Thu, Jul 5, 2012 at 2:19 PM, Alex Lyakas
<alex.bolshoy.btrfs@gmail.com> wrote:
> Alexander,
>
>>>> +               if (advance_left && !left_end_reached) {
>>>> +                       ret = tree_advance(left_root, left_path, &left_level,
>>>> +                                       left_root_level,
>>>> +                                       advance_left != ADVANCE_ONLY_NEXT,
>>>> +                                       &left_key);
>>>> +                       if (ret < 0)
>>>> +                               left_end_reached = ADVANCE;
>>>> +                       advance_left = 0;
>>>> +               }
>>>> +               if (advance_right && !right_end_reached) {
>>>> +                       ret = tree_advance(right_root, right_path, &right_level,
>>>> +                                       right_root_level,
>>>> +                                       advance_right != ADVANCE_ONLY_NEXT,
>>>> +                                       &right_key);
>>>> +                       if (ret < 0)
>>>> +                               right_end_reached = ADVANCE;
>>>> +                       advance_right = 0;
>>>> +               }
>>> Do you think it's worth it to put a check/warning/smth before that,
>>> that either advance_right or advance_left is non-zero, or we have
>>> reached ends in both trees?
>>>
>>>
>> Having the left or right end reached before the other sides end is
>> reached is something that is completely normal and expected.
> What I meant was that when we start the loop, either advance_left!=0
> or advance_right!=0. So I thought it would be good to notice that.
> However, on the very first loop iteration, both of them are zero, so I
> was wrong.
>
>
>>>> +               } else if (left_level == right_level) {
>>> ...
>>>> +               } else if (left_level < right_level) {
>>>> +                       advance_right = ADVANCE;
>>>> +               } else {
>>>> +                       advance_left = ADVANCE;
>>>> +               }
>>> Can you pls explain why it is correct?
>>> Why if we are on lower level in the "newer" tree than we are in the
>>> "older" tree, we need to advance the "older" tree? I.e., why this
>>> implies that we are on the lower key in the "older" tree? (And
>>> vice-versa). I.e., how difference in levels indicates relation between
>>> keys?
>> Difference in levels has no relation to the keys. These advances
>> basically try to keep the two trees positions "in-sync". The compare
>> always tries to get both trees to a point where they are at the same
>> level, as only then we can compare keys. Also, the two trees may have
>> different root levels, this code also handles that case.
>
> Can you pls tell me if I understand your algorithm correctly:
> Basically, we need to get to the leaf levels and compare the items in
> the leafs. Only when we are on the leaf level, we can safely signal
> deletions and additions of items, not on upper levels.
> There is only one optimization: we want to find nodes that are shared,
> and such nodes can be only on the same level. To make this
> optimization happen, we try to always match the levels of the tree.
> This is the purpose of:
>                 } else if (left_level < right_level) {
>                         advance_right = ADVANCE;
>                 } else {
>                         advance_left = ADVANCE;
>                 }
>
Sounds like you understood it.
> Note: I think that instead of comparing levels, we could always
> compare keys and ADVANCE the lower key. (Because on ADVANCing we never
> loose information, we just get closer to leafs, so we don't skip
> anything.) But then there is less chance of optimization. Does this
> make sense? So what you said that we can compare keys only on the same
> level...we can always compare them, correct?
Hmm I think I don't understand what you mean. When we are at level 0,
advancing will in most cases mean that we only get to the next
(slot+1) item. Only in case we are on the last item for this leaf, we
change levels. In that case, the first ADVANCE will go as much levels
up until its possible to got to the next node at that level. After
this, the next ADVANCEs will go down the tree again until we are at
the same level as the other tree is.

Imagine this tree:
    R ______
    / \          |
   /   \         |
  A    B      C
 / \    / \     / \
D  E F  G H  I

It would iterate in this order:
R(slot 0)      -> down
A(slot 0)      -> down
D(slot 0..nr) -> upnext
A(slot 1)      -> down
E(slot 0..nr) -> upnext
R(slot 1)      -> down
B(slot 0)      -> down
F(slot 0..nr) -> upnext
B(slot 1)      -> down
G(slot 0..nr) -> upnext
R(slot 2)      -> down
C(slot 0)      -> down
H(slot 0..nr) -> upnext
C(slot 1)      -> down
I(slot 0..nr)   -> upnext
done because upnext can't advance anymore.

>
> I will now study the rest of your patchset.
>
> Thanks!
> Alex.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 5/7] Btrfs: add btrfs_compare_trees function
  2012-07-05 12:47         ` Alexander Block
@ 2012-07-05 13:04           ` Alex Lyakas
  0 siblings, 0 replies; 43+ messages in thread
From: Alex Lyakas @ 2012-07-05 13:04 UTC (permalink / raw)
  To: Alexander Block; +Cc: linux-btrfs

On Thu, Jul 5, 2012 at 3:47 PM, Alexander Block <ablock84@googlemail.com> wrote:
> On Thu, Jul 5, 2012 at 2:19 PM, Alex Lyakas
> <alex.bolshoy.btrfs@gmail.com> wrote:
>> Alexander,
>>
>>>>> +               if (advance_left && !left_end_reached) {
>>>>> +                       ret = tree_advance(left_root, left_path, &left_level,
>>>>> +                                       left_root_level,
>>>>> +                                       advance_left != ADVANCE_ONLY_NEXT,
>>>>> +                                       &left_key);
>>>>> +                       if (ret < 0)
>>>>> +                               left_end_reached = ADVANCE;
>>>>> +                       advance_left = 0;
>>>>> +               }
>>>>> +               if (advance_right && !right_end_reached) {
>>>>> +                       ret = tree_advance(right_root, right_path, &right_level,
>>>>> +                                       right_root_level,
>>>>> +                                       advance_right != ADVANCE_ONLY_NEXT,
>>>>> +                                       &right_key);
>>>>> +                       if (ret < 0)
>>>>> +                               right_end_reached = ADVANCE;
>>>>> +                       advance_right = 0;
>>>>> +               }
>>>> Do you think it's worth it to put a check/warning/smth before that,
>>>> that either advance_right or advance_left is non-zero, or we have
>>>> reached ends in both trees?
>>>>
>>>>
>>> Having the left or right end reached before the other sides end is
>>> reached is something that is completely normal and expected.
>> What I meant was that when we start the loop, either advance_left!=0
>> or advance_right!=0. So I thought it would be good to notice that.
>> However, on the very first loop iteration, both of them are zero, so I
>> was wrong.
>>
>>
>>>>> +               } else if (left_level == right_level) {
>>>> ...
>>>>> +               } else if (left_level < right_level) {
>>>>> +                       advance_right = ADVANCE;
>>>>> +               } else {
>>>>> +                       advance_left = ADVANCE;
>>>>> +               }
>>>> Can you pls explain why it is correct?
>>>> Why if we are on lower level in the "newer" tree than we are in the
>>>> "older" tree, we need to advance the "older" tree? I.e., why this
>>>> implies that we are on the lower key in the "older" tree? (And
>>>> vice-versa). I.e., how difference in levels indicates relation between
>>>> keys?
>>> Difference in levels has no relation to the keys. These advances
>>> basically try to keep the two trees positions "in-sync". The compare
>>> always tries to get both trees to a point where they are at the same
>>> level, as only then we can compare keys. Also, the two trees may have
>>> different root levels, this code also handles that case.
>>
>> Can you pls tell me if I understand your algorithm correctly:
>> Basically, we need to get to the leaf levels and compare the items in
>> the leafs. Only when we are on the leaf level, we can safely signal
>> deletions and additions of items, not on upper levels.
>> There is only one optimization: we want to find nodes that are shared,
>> and such nodes can be only on the same level. To make this
>> optimization happen, we try to always match the levels of the tree.
>> This is the purpose of:
>>                 } else if (left_level < right_level) {
>>                         advance_right = ADVANCE;
>>                 } else {
>>                         advance_left = ADVANCE;
>>                 }
>>
> Sounds like you understood it.
>> Note: I think that instead of comparing levels, we could always
>> compare keys and ADVANCE the lower key. (Because on ADVANCing we never
>> loose information, we just get closer to leafs, so we don't skip
>> anything.) But then there is less chance of optimization. Does this
>> make sense? So what you said that we can compare keys only on the same
>> level...we can always compare them, correct?
> Hmm I think I don't understand what you mean. When we are at level 0,
> advancing will in most cases mean that we only get to the next
> (slot+1) item. Only in case we are on the last item for this leaf, we
> change levels. In that case, the first ADVANCE will go as much levels
> up until its possible to got to the next node at that level. After
> this, the next ADVANCEs will go down the tree again until we are at
> the same level as the other tree is.
>
> Imagine this tree:
>     R ______
>     / \          |
>    /   \         |
>   A    B      C
>  / \    / \     / \
> D  E F  G H  I
>
> It would iterate in this order:
> R(slot 0)      -> down
> A(slot 0)      -> down
> D(slot 0..nr) -> upnext
> A(slot 1)      -> down
> E(slot 0..nr) -> upnext
> R(slot 1)      -> down
> B(slot 0)      -> down
> F(slot 0..nr) -> upnext
> B(slot 1)      -> down
> G(slot 0..nr) -> upnext
> R(slot 2)      -> down
> C(slot 0)      -> down
> H(slot 0..nr) -> upnext
> C(slot 1)      -> down
> I(slot 0..nr)   -> upnext
> done because upnext can't advance anymore.
>
Thanks for the nice ASCII graphics!

Yes, if one of the scan heads is on level 0, and the other one is on a
different level, then yes, we need to ADVANCE the other one, we cannot
compare keys at this point. We need to get the other head down to
level 0, and then compare keys and items. I overlooked this case.

I meant the case when both heads are on different levels, and none of
the heads is on level 0,... then we can ADVANCE any of them actually
(or both). Because eventually we need to get to the leafs. But if we
want to have a chance of finding a shared block (and skip some leafs),
it's better to ADVANCE the higher one.

Yes, this algorithm is pretty elegant...

Alex.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 4/7] Btrfs: introduce subvol uuids and times
  2012-07-04 13:38 ` [RFC PATCH 4/7] Btrfs: introduce subvol uuids and times Alexander Block
  2012-07-05 11:51   ` Alexander Block
@ 2012-07-05 17:08   ` Zach Brown
  2012-07-05 17:14     ` Alexander Block
  2012-07-16 14:56   ` Arne Jansen
  2 siblings, 1 reply; 43+ messages in thread
From: Zach Brown @ 2012-07-05 17:08 UTC (permalink / raw)
  To: Alexander Block; +Cc: linux-btrfs

> +static long btrfs_ioctl_set_received_subvol(struct file *file,
> +					    void __user *arg)
> +{
> +	struct btrfs_ioctl_received_subvol_args *sa = NULL;

> +	ret = copy_to_user(arg, sa, sizeof(*sa));

> +struct btrfs_ioctl_received_subvol_args {
> +	char	uuid[BTRFS_UUID_SIZE];	/* in */
> +	__u64	stransid;		/* in */
> +	__u64	rtransid;		/* out */
> +	struct timespec stime;		/* in */
> +	struct timespec rtime;		/* out */
> +	__u64	reserved[16];
> +};

Careful, timespec will be different sizes in 32bit userspace and a 64bit
kernel.  I'd use btrfs_timespec to get a fixed size timespec and avoid
all the compat_timespec noise.  (I'd then also worry about padding and
might pack the struct.. I always lose track of the best practice across
all archs.)

- z

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 4/7] Btrfs: introduce subvol uuids and times
  2012-07-05 17:08   ` Zach Brown
@ 2012-07-05 17:14     ` Alexander Block
  2012-07-05 17:20       ` Zach Brown
  0 siblings, 1 reply; 43+ messages in thread
From: Alexander Block @ 2012-07-05 17:14 UTC (permalink / raw)
  To: Zach Brown; +Cc: linux-btrfs

On Thu, Jul 5, 2012 at 7:08 PM, Zach Brown <zab@zabbo.net> wrote:
>> +static long btrfs_ioctl_set_received_subvol(struct file *file,
>> +                                           void __user *arg)
>> +{
>> +       struct btrfs_ioctl_received_subvol_args *sa = NULL;
>
>
>> +       ret = copy_to_user(arg, sa, sizeof(*sa));
>
>
>> +struct btrfs_ioctl_received_subvol_args {
>> +       char    uuid[BTRFS_UUID_SIZE];  /* in */
>> +       __u64   stransid;               /* in */
>> +       __u64   rtransid;               /* out */
>> +       struct timespec stime;          /* in */
>> +       struct timespec rtime;          /* out */
>> +       __u64   reserved[16];
>> +};
>
>
> Careful, timespec will be different sizes in 32bit userspace and a 64bit
> kernel.  I'd use btrfs_timespec to get a fixed size timespec and avoid
> all the compat_timespec noise.  (I'd then also worry about padding and
> might pack the struct.. I always lose track of the best practice across
> all archs.)
Hmm we currently don't have ctree.h in ioctl.h. Can I include it there
or are there problems with that? As an alternative I could define my
own struct for that.
>
> - z

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 4/7] Btrfs: introduce subvol uuids and times
  2012-07-05 17:14     ` Alexander Block
@ 2012-07-05 17:20       ` Zach Brown
  2012-07-05 18:33         ` Ilya Dryomov
  0 siblings, 1 reply; 43+ messages in thread
From: Zach Brown @ 2012-07-05 17:20 UTC (permalink / raw)
  To: Alexander Block; +Cc: linux-btrfs

On 07/05/2012 10:14 AM, Alexander Block wrote:
> On Thu, Jul 5, 2012 at 7:08 PM, Zach Brown<zab@zabbo.net>  wrote:
>>
>> Careful, timespec will be different sizes in 32bit userspace and a 64bit
>> kernel.  I'd use btrfs_timespec to get a fixed size timespec and avoid
>> all the compat_timespec noise.  (I'd then also worry about padding and
>> might pack the struct.. I always lose track of the best practice across
>> all archs.)

> Hmm we currently don't have ctree.h in ioctl.h. Can I include it there
> or are there problems with that? As an alternative I could define my
> own struct for that.

Hmm, yeah, it looks like ioctl.h is well isolated and doesn't really
have a precedent for pulling in format bits from the kernel
implementation.

I'd do as you suggested and just make its own ioctl_timespec with a
comment that its duplicating other similar structures to keep ioctl.h
from getting tangled up in the kernel-side includes.

- z

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 4/7] Btrfs: introduce subvol uuids and times
  2012-07-05 17:20       ` Zach Brown
@ 2012-07-05 18:33         ` Ilya Dryomov
  2012-07-05 18:37           ` Zach Brown
  0 siblings, 1 reply; 43+ messages in thread
From: Ilya Dryomov @ 2012-07-05 18:33 UTC (permalink / raw)
  To: Zach Brown; +Cc: Alexander Block, linux-btrfs

On Thu, Jul 05, 2012 at 10:20:16AM -0700, Zach Brown wrote:
> On 07/05/2012 10:14 AM, Alexander Block wrote:
> >On Thu, Jul 5, 2012 at 7:08 PM, Zach Brown<zab@zabbo.net>  wrote:
> >>
> >>Careful, timespec will be different sizes in 32bit userspace and a 64bit
> >>kernel.  I'd use btrfs_timespec to get a fixed size timespec and avoid
> >>all the compat_timespec noise.  (I'd then also worry about padding and
> >>might pack the struct.. I always lose track of the best practice across
> >>all archs.)
> 
> >Hmm we currently don't have ctree.h in ioctl.h. Can I include it there
> >or are there problems with that? As an alternative I could define my
> >own struct for that.
> 
> Hmm, yeah, it looks like ioctl.h is well isolated and doesn't really
> have a precedent for pulling in format bits from the kernel
> implementation.
> 
> I'd do as you suggested and just make its own ioctl_timespec with a
> comment that its duplicating other similar structures to keep ioctl.h
> from getting tangled up in the kernel-side includes.

This has been done for restriper, see struct btrfs_balance_args vs
struct btrfs_disk_balance_args.  You could do the same thing:

struct btrfs_ioctl_timespec {
	__u64 sec;
	__u32 nsec;
} __attribute__ ((__packed__));

and take endianess into account with le{64,32}_to_cpu and
cpu_to_le{64,32} macros.

Thanks,

		Ilya

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 4/7] Btrfs: introduce subvol uuids and times
  2012-07-05 18:33         ` Ilya Dryomov
@ 2012-07-05 18:37           ` Zach Brown
  2012-07-05 18:59             ` Ilya Dryomov
  0 siblings, 1 reply; 43+ messages in thread
From: Zach Brown @ 2012-07-05 18:37 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Alexander Block, linux-btrfs


> and take endianess into account with le{64,32}_to_cpu and
> cpu_to_le{64,32} macros.

The kernel doesn't support system calls from userspace of a different
endianness, no worries there :)

- z

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 4/7] Btrfs: introduce subvol uuids and times
  2012-07-05 18:37           ` Zach Brown
@ 2012-07-05 18:59             ` Ilya Dryomov
  2012-07-05 19:01               ` Zach Brown
  0 siblings, 1 reply; 43+ messages in thread
From: Ilya Dryomov @ 2012-07-05 18:59 UTC (permalink / raw)
  To: Zach Brown; +Cc: Alexander Block, linux-btrfs

On Thu, Jul 05, 2012 at 11:37:40AM -0700, Zach Brown wrote:
> 
> >and take endianess into account with le{64,32}_to_cpu and
> >cpu_to_le{64,32} macros.
> 
> The kernel doesn't support system calls from userspace of a different
> endianness, no worries there :)

What if you are on a big-endian machine with a big-endian kernel and
userspace?  Everything on-disk should be little-endian, so if you are
going to write stuff you got from userspace to disk, at some point you
have to make sure you are writing out bytes in the right order.

Alex already does that, so my remarks are moot ;)

+       root_item->stime.sec = cpu_to_le64(sa->stime.tv_sec);
+       root_item->stime.nsec = cpu_to_le64(sa->stime.tv_nsec);
+       root_item->rtime.sec = cpu_to_le64(sa->rtime.tv_sec);
+       root_item->rtime.nsec = cpu_to_le64(sa->rtime.tv_nsec);

Thanks,

		Ilya

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 4/7] Btrfs: introduce subvol uuids and times
  2012-07-05 18:59             ` Ilya Dryomov
@ 2012-07-05 19:01               ` Zach Brown
  2012-07-05 19:18                 ` Alexander Block
  0 siblings, 1 reply; 43+ messages in thread
From: Zach Brown @ 2012-07-05 19:01 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Alexander Block, linux-btrfs

On 07/05/2012 11:59 AM, Ilya Dryomov wrote:

> What if you are on a big-endian machine with a big-endian kernel and
> userspace?  Everything on-disk should be little-endian, so if you are
> going to write stuff you got from userspace to disk, at some point you
> have to make sure you are writing out bytes in the right order.
>
> Alex already does that, so my remarks are moot ;)

Yeah, indeed, we were only talking about the ioctl interface crossing
the user<->kernel barrier :).

- z

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 4/7] Btrfs: introduce subvol uuids and times
  2012-07-05 19:01               ` Zach Brown
@ 2012-07-05 19:18                 ` Alexander Block
  2012-07-05 19:24                   ` Ilya Dryomov
  0 siblings, 1 reply; 43+ messages in thread
From: Alexander Block @ 2012-07-05 19:18 UTC (permalink / raw)
  To: Zach Brown; +Cc: Ilya Dryomov, linux-btrfs

On Thu, Jul 5, 2012 at 9:01 PM, Zach Brown <zab@zabbo.net> wrote:
> On 07/05/2012 11:59 AM, Ilya Dryomov wrote:
>
>> What if you are on a big-endian machine with a big-endian kernel and
>> userspace?  Everything on-disk should be little-endian, so if you are
>> going to write stuff you got from userspace to disk, at some point you
>> have to make sure you are writing out bytes in the right order.
>>
>> Alex already does that, so my remarks are moot ;)
>
>
> Yeah, indeed, we were only talking about the ioctl interface crossing
> the user<->kernel barrier :).
>
I decided to not use __leXX variables in the new btrfs_ioctl_timespec
structure because as most arguments found ioctls are currently in cpu
dependent endianess. The kernel will then do the endianess conversion
as it already did with struct timespec.
> - z

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 4/7] Btrfs: introduce subvol uuids and times
  2012-07-05 19:18                 ` Alexander Block
@ 2012-07-05 19:24                   ` Ilya Dryomov
  2012-07-05 19:43                     ` Alexander Block
  0 siblings, 1 reply; 43+ messages in thread
From: Ilya Dryomov @ 2012-07-05 19:24 UTC (permalink / raw)
  To: Alexander Block; +Cc: Zach Brown, linux-btrfs

On Thu, Jul 05, 2012 at 09:18:41PM +0200, Alexander Block wrote:
> On Thu, Jul 5, 2012 at 9:01 PM, Zach Brown <zab@zabbo.net> wrote:
> > On 07/05/2012 11:59 AM, Ilya Dryomov wrote:
> >
> >> What if you are on a big-endian machine with a big-endian kernel and
> >> userspace?  Everything on-disk should be little-endian, so if you are
> >> going to write stuff you got from userspace to disk, at some point you
> >> have to make sure you are writing out bytes in the right order.
> >>
> >> Alex already does that, so my remarks are moot ;)
> >
> >
> > Yeah, indeed, we were only talking about the ioctl interface crossing
> > the user<->kernel barrier :).
> >
> I decided to not use __leXX variables in the new btrfs_ioctl_timespec
> structure because as most arguments found ioctls are currently in cpu
> dependent endianess. The kernel will then do the endianess conversion
> as it already did with struct timespec.

That's exactly the point of adding btrfs_ioctl_timespec instead of just
copying btrfs_timespec definition.

Thanks,

		Ilya

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 4/7] Btrfs: introduce subvol uuids and times
  2012-07-05 19:24                   ` Ilya Dryomov
@ 2012-07-05 19:43                     ` Alexander Block
  0 siblings, 0 replies; 43+ messages in thread
From: Alexander Block @ 2012-07-05 19:43 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Zach Brown, linux-btrfs

On Thu, Jul 5, 2012 at 9:24 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
> On Thu, Jul 05, 2012 at 09:18:41PM +0200, Alexander Block wrote:
>> On Thu, Jul 5, 2012 at 9:01 PM, Zach Brown <zab@zabbo.net> wrote:
>> > On 07/05/2012 11:59 AM, Ilya Dryomov wrote:
>> >
>> >> What if you are on a big-endian machine with a big-endian kernel and
>> >> userspace?  Everything on-disk should be little-endian, so if you are
>> >> going to write stuff you got from userspace to disk, at some point you
>> >> have to make sure you are writing out bytes in the right order.
>> >>
>> >> Alex already does that, so my remarks are moot ;)
>> >
>> >
>> > Yeah, indeed, we were only talking about the ioctl interface crossing
>> > the user<->kernel barrier :).
>> >
>> I decided to not use __leXX variables in the new btrfs_ioctl_timespec
>> structure because as most arguments found ioctls are currently in cpu
>> dependent endianess. The kernel will then do the endianess conversion
>> as it already did with struct timespec.
>
> That's exactly the point of adding btrfs_ioctl_timespec instead of just
> copying btrfs_timespec definition.
Pushed to kernel and user space repos.
>
> Thanks,
>
>                 Ilya

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 7/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 2)
  2012-07-04 13:38 ` [RFC PATCH 7/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 2) Alexander Block
@ 2012-07-10 15:26   ` Alex Lyakas
  2012-07-25 13:37     ` Alexander Block
  2012-07-23 11:16   ` Arne Jansen
  2012-07-23 15:17   ` Alex Lyakas
  2 siblings, 1 reply; 43+ messages in thread
From: Alex Lyakas @ 2012-07-10 15:26 UTC (permalink / raw)
  To: Alexander Block; +Cc: linux-btrfs

Alexander,
this focuses on area of sending file extents:

> +static int is_extent_unchanged(struct send_ctx *sctx,
> +                              struct btrfs_path *left_path,
> +                              struct btrfs_key *ekey)
> +{
> +       int ret = 0;
> +       struct btrfs_key key;
> +       struct btrfs_path *path = NULL;
> +       struct extent_buffer *eb;
> +       int slot;
> +       struct btrfs_key found_key;
> +       struct btrfs_file_extent_item *ei;
> +       u64 left_disknr;
> +       u64 right_disknr;
> +       u64 left_offset;
> +       u64 right_offset;
> +       u64 left_len;
> +       u64 right_len;
> +       u8 left_type;
> +       u8 right_type;
> +
> +       path = alloc_path_for_send();
> +       if (!path)
> +               return -ENOMEM;
> +
> +       eb = left_path->nodes[0];
> +       slot = left_path->slots[0];
> +
> +       ei = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item);
> +       left_type = btrfs_file_extent_type(eb, ei);
> +       left_disknr = btrfs_file_extent_disk_bytenr(eb, ei);
> +       left_len = btrfs_file_extent_num_bytes(eb, ei);
> +       left_offset = btrfs_file_extent_offset(eb, ei);
> +
> +       if (left_type != BTRFS_FILE_EXTENT_REG) {
> +               ret = 0;
> +               goto out;
> +       }
> +
> +       key.objectid = ekey->objectid;
> +       key.type = BTRFS_EXTENT_DATA_KEY;
> +       key.offset = ekey->offset;
> +
> +       while (1) {
> +               ret = btrfs_search_slot_for_read(sctx->parent_root, &key, path,
> +                               0, 0);
> +               if (ret < 0)
> +                       goto out;
> +               if (ret) {
> +                       ret = 0;
> +                       goto out;
> +               }
> +               btrfs_item_key_to_cpu(path->nodes[0], &found_key,
> +                               path->slots[0]);
> +               if (found_key.objectid != key.objectid ||
> +                   found_key.type != key.type) {
> +                       ret = 0;
> +                       goto out;
> +               }
> +
> +               eb = path->nodes[0];
> +               slot = path->slots[0];
> +
> +               ei = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item);
> +               right_type = btrfs_file_extent_type(eb, ei);
> +               right_disknr = btrfs_file_extent_disk_bytenr(eb, ei);
> +               right_len = btrfs_file_extent_num_bytes(eb, ei);
> +               right_offset = btrfs_file_extent_offset(eb, ei);
> +               btrfs_release_path(path);
> +
> +               if (right_type != BTRFS_FILE_EXTENT_REG) {
> +                       ret = 0;
> +                       goto out;
> +               }
> +
> +               if (left_disknr != right_disknr) {
> +                       ret = 0;
> +                       goto out;
> +               }
> +
> +               key.offset = found_key.offset + right_len;
> +               if (key.offset >= ekey->offset + left_len) {
> +                       ret = 1;
> +                       goto out;
> +               }
> +       }
> +
> +out:
> +       btrfs_free_path(path);
> +       return ret;
> +}
> +

Should we always treat left extent with bytenr==0 as not changed?
Because right now, it simply reads and sends data of such extent,
while bytenr==0 means "no data allocated here". Since we always do
send_truncate() afterwards, file size will always be correct, so we
can just skip bytenr==0 extents.
Same is true for BTRFS_FILE_EXTENT_PREALLOC extents, I think. Those
also don't contain real data.
So something like:
if (left_disknr == 0 || left_type == BTRFS_FILE_EXTENT_REG) {
	ret = 1;
	goto out;
}
before we check for BTRFS_FILE_EXTENT_REG.

Now I have a question about the rest of the logic that decides that
extent is unchanged. I understand that if we see the same extent (same
disk_bytenr) shared between parent_root and send_root, then it must
contain the same data, even in nodatacow mode, because on a first
write to such shared extent, it is cow'ed even with nodatacow.

However, shouldn't we check btrfs_file_extent_offset(), to make sure
that both send_root and parent_root point at the same offset into
extent from the same file offset? Because if extent_offset values are
different, then the data of the file might different, even though we
are talking about the same extent.

So I am thinking about something like:

- ekey.offset points at data at logical address
left_disknr+left_offset (logical address within CHUNK_ITEM address
space) for left_len bytes
- found_key.offset points at data at logical address
right_disknr+right_offset for right_len
- we know that found_key.offset <= ekey.offset

So we need to ensure that left_disknr==right_disknr and also:
right_disknr+right_offset + (ekey.offset - found_key.offset) ==
left_disknr+left_offset
or does this while loop somehow ensures this equation?

However, I must admit I don't fully understand the logic behind
deciding that extent is unchanged. Can you pls explain what this tries
to accomplish, and why it decides that extent is unchanged here:
key.offset = found_key.offset + right_len;
if (key.offset >= ekey->offset + left_len) {
	ret = 1;
	goto out;
}

Also: when searching for the next extent, should we use
btrfs_file_extent_num_bytes() or btrfs_file_extent_disk_num_bytes()?
They are not equal sometimes...not sure at which offset the next
extent (if any) should be. What about holes in files? Then we will
have non-consecutive offsets.

Thanks,
Alex.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 4/7] Btrfs: introduce subvol uuids and times
  2012-07-04 13:38 ` [RFC PATCH 4/7] Btrfs: introduce subvol uuids and times Alexander Block
  2012-07-05 11:51   ` Alexander Block
  2012-07-05 17:08   ` Zach Brown
@ 2012-07-16 14:56   ` Arne Jansen
  2012-07-23 19:41     ` Alexander Block
  2 siblings, 1 reply; 43+ messages in thread
From: Arne Jansen @ 2012-07-16 14:56 UTC (permalink / raw)
  To: Alexander Block; +Cc: linux-btrfs

On 04.07.2012 15:38, Alexander Block wrote:
> This patch introduces uuids for subvolumes. Each
> subvolume has it's own uuid. In case it was snapshotted,
> it also contains parent_uuid. In case it was received,
> it also contains received_uuid.
> 
> It also introduces subvolume ctime/otime/stime/rtime. The
> first two are comparable to the times found in inodes. otime
> is the origin/creation time and ctime is the change time.
> stime/rtime are only valid on received subvolumes.
> stime is the time of the subvolume when it was
> sent. rtime is the time of the subvolume when it was
> received.
> 
> Additionally to the times, we have a transid for each
> time. They are updated at the same place as the times.
> 
> btrfs receive uses stransid and rtransid to find out
> if a received subvolume changed in the meantime.
> 
> If an older kernel mounts a filesystem with the
> extented fields, all fields become invalid. The next
> mount with a new kernel will detect this and reset the
> fields.
> 
> Signed-off-by: Alexander Block <ablock84@googlemail.com>
> ---
>  fs/btrfs/ctree.h       |   43 ++++++++++++++++++++++
>  fs/btrfs/disk-io.c     |    2 +
>  fs/btrfs/inode.c       |    4 ++
>  fs/btrfs/ioctl.c       |   96 ++++++++++++++++++++++++++++++++++++++++++++++--
>  fs/btrfs/ioctl.h       |   13 +++++++
>  fs/btrfs/root-tree.c   |   92 +++++++++++++++++++++++++++++++++++++++++++---
>  fs/btrfs/transaction.c |   17 +++++++++
>  7 files changed, 258 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 8cfde93..2bd5df8 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -709,6 +709,35 @@ struct btrfs_root_item {
>  	struct btrfs_disk_key drop_progress;
>  	u8 drop_level;
>  	u8 level;
> +
> +	/*
> +	 * The following fields appear after subvol_uuids+subvol_times
> +	 * were introduced.
> +	 */
> +
> +	/*
> +	 * This generation number is used to test if the new fields are valid
> +	 * and up to date while reading the root item. Everytime the root item
> +	 * is written out, the "generation" field is copied into this field. If
> +	 * anyone ever mounted the fs with an older kernel, we will have
> +	 * mismatching generation values here and thus must invalidate the
> +	 * new fields. See btrfs_update_root and btrfs_find_last_root for
> +	 * details.
> +	 * the offset of generation_v2 is also used as the start for the memset
> +	 * when invalidating the fields.
> +	 */
> +	__le64 generation_v2;
> +	u8 uuid[BTRFS_UUID_SIZE];
> +	u8 parent_uuid[BTRFS_UUID_SIZE];
> +	u8 received_uuid[BTRFS_UUID_SIZE];
> +	__le64 ctransid; /* updated when an inode changes */
> +	__le64 otransid; /* trans when created */
> +	__le64 stransid; /* trans when sent. non-zero for received subvol */
> +	__le64 rtransid; /* trans when received. non-zero for received subvol */
> +	struct btrfs_timespec ctime;
> +	struct btrfs_timespec otime;
> +	struct btrfs_timespec stime;
> +	struct btrfs_timespec rtime;
>  } __attribute__ ((__packed__));
>  
>  /*
> @@ -1416,6 +1445,8 @@ struct btrfs_root {
>  	dev_t anon_dev;
>  
>  	int force_cow;
> +
> +	spinlock_t root_times_lock;
>  };
>  
>  struct btrfs_ioctl_defrag_range_args {
> @@ -2189,6 +2220,16 @@ BTRFS_SETGET_STACK_FUNCS(root_used, struct btrfs_root_item, bytes_used, 64);
>  BTRFS_SETGET_STACK_FUNCS(root_limit, struct btrfs_root_item, byte_limit, 64);
>  BTRFS_SETGET_STACK_FUNCS(root_last_snapshot, struct btrfs_root_item,
>  			 last_snapshot, 64);
> +BTRFS_SETGET_STACK_FUNCS(root_generation_v2, struct btrfs_root_item,
> +			 generation_v2, 64);
> +BTRFS_SETGET_STACK_FUNCS(root_ctransid, struct btrfs_root_item,
> +			 ctransid, 64);
> +BTRFS_SETGET_STACK_FUNCS(root_otransid, struct btrfs_root_item,
> +			 otransid, 64);
> +BTRFS_SETGET_STACK_FUNCS(root_stransid, struct btrfs_root_item,
> +			 stransid, 64);
> +BTRFS_SETGET_STACK_FUNCS(root_rtransid, struct btrfs_root_item,
> +			 rtransid, 64);
>  
>  static inline bool btrfs_root_readonly(struct btrfs_root *root)
>  {
> @@ -2829,6 +2870,8 @@ int btrfs_find_orphan_roots(struct btrfs_root *tree_root);
>  void btrfs_set_root_node(struct btrfs_root_item *item,
>  			 struct extent_buffer *node);
>  void btrfs_check_and_init_root_item(struct btrfs_root_item *item);
> +void btrfs_update_root_times(struct btrfs_trans_handle *trans,
> +			     struct btrfs_root *root);
>  
>  /* dir-item.c */
>  int btrfs_insert_dir_item(struct btrfs_trans_handle *trans,
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 7b845ff..d3b49ad 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1182,6 +1182,8 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
>  	root->defrag_running = 0;
>  	root->root_key.objectid = objectid;
>  	root->anon_dev = 0;
> +
> +	spin_lock_init(&root->root_times_lock);
>  }
>  
>  static int __must_check find_and_setup_root(struct btrfs_root *tree_root,
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 139be17..0f6a65d 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2734,6 +2734,8 @@ noinline int btrfs_update_inode(struct btrfs_trans_handle *trans,
>  	 */
>  	if (!btrfs_is_free_space_inode(root, inode)
>  	    && root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID) {
> +		btrfs_update_root_times(trans, root);
> +
>  		ret = btrfs_delayed_update_inode(trans, root, inode);
>  		if (!ret)
>  			btrfs_set_inode_last_trans(trans, inode);
> @@ -4728,6 +4730,8 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans,
>  	trace_btrfs_inode_new(inode);
>  	btrfs_set_inode_last_trans(trans, inode);
>  
> +	btrfs_update_root_times(trans, root);
> +
>  	return inode;
>  fail:
>  	if (dir)
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 7011871..8d258cb 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -41,6 +41,7 @@
>  #include <linux/vmalloc.h>
>  #include <linux/slab.h>
>  #include <linux/blkdev.h>
> +#include <linux/uuid.h>
>  #include "compat.h"
>  #include "ctree.h"
>  #include "disk-io.h"
> @@ -346,11 +347,13 @@ static noinline int create_subvol(struct btrfs_root *root,
>  	struct btrfs_root *new_root;
>  	struct dentry *parent = dentry->d_parent;
>  	struct inode *dir;
> +	struct timespec cur_time = CURRENT_TIME;
>  	int ret;
>  	int err;
>  	u64 objectid;
>  	u64 new_dirid = BTRFS_FIRST_FREE_OBJECTID;
>  	u64 index = 0;
> +	uuid_le new_uuid;
>  
>  	ret = btrfs_find_free_objectid(root->fs_info->tree_root, &objectid);
>  	if (ret)
> @@ -389,8 +392,9 @@ static noinline int create_subvol(struct btrfs_root *root,
>  			    BTRFS_UUID_SIZE);
>  	btrfs_mark_buffer_dirty(leaf);
>  
> +	memset(&root_item, 0, sizeof(root_item));
> +
>  	inode_item = &root_item.inode;
> -	memset(inode_item, 0, sizeof(*inode_item));
>  	inode_item->generation = cpu_to_le64(1);
>  	inode_item->size = cpu_to_le64(3);
>  	inode_item->nlink = cpu_to_le32(1);
> @@ -408,8 +412,15 @@ static noinline int create_subvol(struct btrfs_root *root,
>  	btrfs_set_root_used(&root_item, leaf->len);
>  	btrfs_set_root_last_snapshot(&root_item, 0);
>  
> -	memset(&root_item.drop_progress, 0, sizeof(root_item.drop_progress));
> -	root_item.drop_level = 0;
> +	btrfs_set_root_generation_v2(&root_item,
> +			btrfs_root_generation(&root_item));
> +	uuid_le_gen(&new_uuid);
> +	memcpy(root_item.uuid, new_uuid.b, BTRFS_UUID_SIZE);
> +	root_item.otime.sec = cpu_to_le64(cur_time.tv_sec);
> +	root_item.otime.nsec = cpu_to_le64(cur_time.tv_nsec);
> +	root_item.ctime = root_item.otime;
> +	btrfs_set_root_ctransid(&root_item, trans->transid);
> +	btrfs_set_root_otransid(&root_item, trans->transid);
>  
>  	btrfs_tree_unlock(leaf);
>  	free_extent_buffer(leaf);
> @@ -3395,6 +3406,83 @@ out:
>  	return ret;
>  }
>  
> +static long btrfs_ioctl_set_received_subvol(struct file *file,
> +					    void __user *arg)
> +{
> +	struct btrfs_ioctl_received_subvol_args *sa = NULL;
> +	struct inode *inode = fdentry(file)->d_inode;
> +	struct btrfs_root *root = BTRFS_I(inode)->root;
> +	struct btrfs_root_item *root_item = &root->root_item;
> +	struct btrfs_trans_handle *trans;
> +	int ret = 0;
> +
> +	ret = mnt_want_write_file(file);
> +	if (ret < 0)
> +		return ret;
> +
> +	down_write(&root->fs_info->subvol_sem);
> +
> +	if (btrfs_ino(inode) != BTRFS_FIRST_FREE_OBJECTID) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (btrfs_root_readonly(root)) {
> +		ret = -EROFS;
> +		goto out;
> +	}
> +
> +	if (!inode_owner_or_capable(inode)) {
> +		ret = -EACCES;
> +		goto out;
> +	}
> +
> +	sa = memdup_user(arg, sizeof(*sa));
> +	if (IS_ERR(sa)) {
> +		ret = PTR_ERR(sa);
> +		sa = NULL;
> +		goto out;
> +	}
> +
> +	trans = btrfs_start_transaction(root, 1);
> +	if (IS_ERR(trans)) {
> +		ret = PTR_ERR(trans);
> +		trans = NULL;
> +		goto out;
> +	}
> +
> +	sa->rtransid = trans->transid;
> +	sa->rtime = CURRENT_TIME;
> +
> +	memcpy(root_item->received_uuid, sa->uuid, BTRFS_UUID_SIZE);
> +	btrfs_set_root_stransid(root_item, sa->stransid);
> +	btrfs_set_root_rtransid(root_item, sa->rtransid);
> +	root_item->stime.sec = cpu_to_le64(sa->stime.tv_sec);
> +	root_item->stime.nsec = cpu_to_le64(sa->stime.tv_nsec);
> +	root_item->rtime.sec = cpu_to_le64(sa->rtime.tv_sec);
> +	root_item->rtime.nsec = cpu_to_le64(sa->rtime.tv_nsec);
> +
> +	ret = btrfs_update_root(trans, root->fs_info->tree_root,
> +				&root->root_key, &root->root_item);
> +	if (ret < 0) {
> +		goto out;

are you leaking a trans handle here?

> +	} else {
> +		ret = btrfs_commit_transaction(trans, root);
> +		if (ret < 0)
> +			goto out;
> +	}
> +
> +	ret = copy_to_user(arg, sa, sizeof(*sa));
> +	if (ret)
> +		ret = -EFAULT;
> +
> +out:
> +	kfree(sa);
> +	up_write(&root->fs_info->subvol_sem);
> +	mnt_drop_write_file(file);
> +	return ret;
> +}
> +
>  long btrfs_ioctl(struct file *file, unsigned int
>  		cmd, unsigned long arg)
>  {
> @@ -3477,6 +3565,8 @@ long btrfs_ioctl(struct file *file, unsigned int
>  		return btrfs_ioctl_balance_ctl(root, arg);
>  	case BTRFS_IOC_BALANCE_PROGRESS:
>  		return btrfs_ioctl_balance_progress(root, argp);
> +	case BTRFS_IOC_SET_RECEIVED_SUBVOL:
> +		return btrfs_ioctl_set_received_subvol(file, argp);
>  	case BTRFS_IOC_GET_DEV_STATS:
>  		return btrfs_ioctl_get_dev_stats(root, argp, 0);
>  	case BTRFS_IOC_GET_AND_RESET_DEV_STATS:
> diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
> index e440aa6..c9e3fac 100644
> --- a/fs/btrfs/ioctl.h
> +++ b/fs/btrfs/ioctl.h
> @@ -295,6 +295,15 @@ struct btrfs_ioctl_get_dev_stats {
>  	__u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */
>  };
>  
> +struct btrfs_ioctl_received_subvol_args {
> +	char	uuid[BTRFS_UUID_SIZE];	/* in */
> +	__u64	stransid;		/* in */
> +	__u64	rtransid;		/* out */
> +	struct timespec stime;		/* in */
> +	struct timespec rtime;		/* out */
> +	__u64	reserved[16];

What is this reserved used for? I don't see a mechanism that could be
used to signal that there are useful information here, other than
using a different ioctl.

> +};
> +
>  #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
>  				   struct btrfs_ioctl_vol_args)
>  #define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
> @@ -359,6 +368,10 @@ struct btrfs_ioctl_get_dev_stats {
>  					struct btrfs_ioctl_ino_path_args)
>  #define BTRFS_IOC_LOGICAL_INO _IOWR(BTRFS_IOCTL_MAGIC, 36, \
>  					struct btrfs_ioctl_ino_path_args)
> +
> +#define BTRFS_IOC_SET_RECEIVED_SUBVOL _IOWR(BTRFS_IOCTL_MAGIC, 37, \
> +				struct btrfs_ioctl_received_subvol_args)
> +
>  #define BTRFS_IOC_GET_DEV_STATS _IOWR(BTRFS_IOCTL_MAGIC, 52, \
>  				      struct btrfs_ioctl_get_dev_stats)
>  #define BTRFS_IOC_GET_AND_RESET_DEV_STATS _IOWR(BTRFS_IOCTL_MAGIC, 53, \
> diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
> index 24fb8ce..17d638e 100644
> --- a/fs/btrfs/root-tree.c
> +++ b/fs/btrfs/root-tree.c
> @@ -16,6 +16,7 @@
>   * Boston, MA 021110-1307, USA.
>   */
>  
> +#include <linux/uuid.h>
>  #include "ctree.h"
>  #include "transaction.h"
>  #include "disk-io.h"
> @@ -25,6 +26,9 @@
>   * lookup the root with the highest offset for a given objectid.  The key we do
>   * find is copied into 'key'.  If we find something return 0, otherwise 1, < 0
>   * on error.
> + * We also check if the root was once mounted with an older kernel. If we detect
> + * this, the new fields coming after 'level' get overwritten with zeros so to
> + * invalidate the fields.

... "This is detected by a mismatch of the 2 generation fields" ... or something
like that.

>   */
>  int btrfs_find_last_root(struct btrfs_root *root, u64 objectid,
>  			struct btrfs_root_item *item, struct btrfs_key *key)
> @@ -35,6 +39,9 @@ int btrfs_find_last_root(struct btrfs_root *root, u64 objectid,
>  	struct extent_buffer *l;
>  	int ret;
>  	int slot;
> +	int len;
> +	int need_reset = 0;
> +	uuid_le uuid;
>  
>  	search_key.objectid = objectid;
>  	search_key.type = BTRFS_ROOT_ITEM_KEY;
> @@ -60,11 +67,36 @@ int btrfs_find_last_root(struct btrfs_root *root, u64 objectid,
>  		ret = 1;
>  		goto out;
>  	}
> -	if (item)
> +	if (item) {
> +		len = btrfs_item_size_nr(l, slot);
>  		read_extent_buffer(l, item, btrfs_item_ptr_offset(l, slot),
> -				   sizeof(*item));
> +				min_t(int, len, (int)sizeof(*item)));
> +		if (len < sizeof(*item))
> +			need_reset = 1;
> +		if (!need_reset && btrfs_root_generation(item)
> +			!= btrfs_root_generation_v2(item)) {
> +			if (btrfs_root_generation_v2(item) != 0) {
> +				printk(KERN_WARNING "btrfs: mismatching "
> +						"generation and generation_v2 "
> +						"found in root item. This root "
> +						"was probably mounted with an "
> +						"older kernel. Resetting all "
> +						"new fields.\n");
> +			}
> +			need_reset = 1;
> +		}
> +		if (need_reset) {
> +			memset(&item->generation_v2, 0,
> +				sizeof(*item) - offsetof(struct btrfs_root_item,
> +						generation_v2));
> +
> +			uuid_le_gen(&uuid);
> +			memcpy(item->uuid, uuid.b, BTRFS_UUID_SIZE);
> +		}
> +	}
>  	if (key)
>  		memcpy(key, &found_key, sizeof(found_key));
> +
>  	ret = 0;
>  out:
>  	btrfs_free_path(path);
> @@ -91,16 +123,15 @@ int btrfs_update_root(struct btrfs_trans_handle *trans, struct btrfs_root
>  	int ret;
>  	int slot;
>  	unsigned long ptr;
> +	int old_len;
>  
>  	path = btrfs_alloc_path();
>  	if (!path)
>  		return -ENOMEM;
>  
>  	ret = btrfs_search_slot(trans, root, key, path, 0, 1);
> -	if (ret < 0) {
> -		btrfs_abort_transaction(trans, root, ret);
> -		goto out;
> -	}
> +	if (ret < 0)
> +		goto out_abort;
>  
>  	if (ret != 0) {
>  		btrfs_print_leaf(root, path->nodes[0]);
> @@ -113,11 +144,47 @@ int btrfs_update_root(struct btrfs_trans_handle *trans, struct btrfs_root
>  	l = path->nodes[0];
>  	slot = path->slots[0];
>  	ptr = btrfs_item_ptr_offset(l, slot);
> +	old_len = btrfs_item_size_nr(l, slot);
> +
> +	/*
> +	 * If this is the first time we update the root item which originated
> +	 * from an older kernel, we need to enlarge the item size to make room
> +	 * for the added fields.
> +	 */
> +	if (old_len < sizeof(*item)) {
> +		btrfs_release_path(path);
> +		ret = btrfs_search_slot(trans, root, key, path,
> +				-1, 1);
> +		if (ret < 0)
> +			goto out_abort;
> +		ret = btrfs_del_item(trans, root, path);
> +		if (ret < 0)
> +			goto out_abort;
> +		btrfs_release_path(path);
> +		ret = btrfs_insert_empty_item(trans, root, path,
> +				key, sizeof(*item));
> +		if (ret < 0)
> +			goto out_abort;
> +		l = path->nodes[0];
> +		slot = path->slots[0];
> +		ptr = btrfs_item_ptr_offset(l, slot);
> +	}
> +
> +	/*
> +	 * Update generation_v2 so at the next mount we know the new root
> +	 * fields are valid.
> +	 */
> +	btrfs_set_root_generation_v2(item, btrfs_root_generation(item));
> +
>  	write_extent_buffer(l, item, ptr, sizeof(*item));
>  	btrfs_mark_buffer_dirty(path->nodes[0]);
>  out:
>  	btrfs_free_path(path);
>  	return ret;
> +
> +out_abort:
> +	btrfs_abort_transaction(trans, root, ret);
> +	goto out;
>  }
>  
>  int btrfs_insert_root(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> @@ -454,3 +521,16 @@ void btrfs_check_and_init_root_item(struct btrfs_root_item *root_item)
>  		root_item->byte_limit = 0;
>  	}
>  }
> +
> +void btrfs_update_root_times(struct btrfs_trans_handle *trans,
> +			     struct btrfs_root *root)
> +{
> +	struct btrfs_root_item *item = &root->root_item;
> +	struct timespec ct = CURRENT_TIME;
> +
> +	spin_lock(&root->root_times_lock);
> +	item->ctransid = trans->transid;
> +	item->ctime.sec = cpu_to_le64(ct.tv_sec);
> +	item->ctime.nsec = cpu_to_le64(ct.tv_nsec);
> +	spin_unlock(&root->root_times_lock);
> +}
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index b72b068..a21f308 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -22,6 +22,7 @@
>  #include <linux/writeback.h>
>  #include <linux/pagemap.h>
>  #include <linux/blkdev.h>
> +#include <linux/uuid.h>
>  #include "ctree.h"
>  #include "disk-io.h"
>  #include "transaction.h"
> @@ -926,11 +927,13 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
>  	struct dentry *dentry;
>  	struct extent_buffer *tmp;
>  	struct extent_buffer *old;
> +	struct timespec cur_time = CURRENT_TIME;
>  	int ret;
>  	u64 to_reserve = 0;
>  	u64 index = 0;
>  	u64 objectid;
>  	u64 root_flags;
> +	uuid_le new_uuid;
>  
>  	rsv = trans->block_rsv;
>  
> @@ -1016,6 +1019,20 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
>  		root_flags &= ~BTRFS_ROOT_SUBVOL_RDONLY;
>  	btrfs_set_root_flags(new_root_item, root_flags);
>  
> +	btrfs_set_root_generation_v2(new_root_item,
> +			trans->transid);
> +	uuid_le_gen(&new_uuid);
> +	memcpy(new_root_item->uuid, new_uuid.b, BTRFS_UUID_SIZE);
> +	memcpy(new_root_item->parent_uuid, root->root_item.uuid,
> +			BTRFS_UUID_SIZE);
> +	new_root_item->otime.sec = cpu_to_le64(cur_time.tv_sec);
> +	new_root_item->otime.nsec = cpu_to_le64(cur_time.tv_nsec);
> +	btrfs_set_root_otransid(new_root_item, trans->transid);
> +	memset(&new_root_item->stime, 0, sizeof(new_root_item->stime));
> +	memset(&new_root_item->rtime, 0, sizeof(new_root_item->rtime));
> +	btrfs_set_root_stransid(new_root_item, 0);
> +	btrfs_set_root_rtransid(new_root_item, 0);
> +
>  	old = btrfs_lock_root_node(root);
>  	ret = btrfs_cow_block(trans, root, old, NULL, 0, &old);
>  	if (ret) {


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 6/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 1)
  2012-07-04 13:38 ` [RFC PATCH 6/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 1) Alexander Block
@ 2012-07-18  6:59   ` Arne Jansen
  2012-07-25 17:33     ` Alexander Block
  2012-07-21 10:53   ` Arne Jansen
  1 sibling, 1 reply; 43+ messages in thread
From: Arne Jansen @ 2012-07-18  6:59 UTC (permalink / raw)
  To: Alexander Block; +Cc: linux-btrfs

On 04.07.2012 15:38, Alexander Block wrote:
> This patch introduces the BTRFS_IOC_SEND ioctl that is
> required for send. It allows btrfs-progs to implement
> full and incremental sends. Patches for btrfs-progs will
> follow.
> 
> I had to split the patch as it got larger then 100k which is
> the limit for the mailing list. The first part only contains
> the send.h header and the helper functions for TLV handling
> and long path name handling and some other helpers. The second
> part contains the actual send logic from send.c
> 
> Signed-off-by: Alexander Block <ablock84@googlemail.com>
> ---
>  fs/btrfs/Makefile |    2 +-
>  fs/btrfs/ioctl.h  |   10 +
>  fs/btrfs/send.c   | 1009 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/send.h   |  126 +++++++
>  4 files changed, 1146 insertions(+), 1 deletion(-)
>  create mode 100644 fs/btrfs/send.c
>  create mode 100644 fs/btrfs/send.h
> 
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index 0c4fa2b..f740644 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -8,7 +8,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
>  	   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
>  	   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
>  	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
> -	   reada.o backref.o ulist.o
> +	   reada.o backref.o ulist.o send.o
>  
>  btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>  btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
> diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
> index c9e3fac..282bc64 100644
> --- a/fs/btrfs/ioctl.h
> +++ b/fs/btrfs/ioctl.h
> @@ -304,6 +304,15 @@ struct btrfs_ioctl_received_subvol_args {
>  	__u64	reserved[16];
>  };
>  
> +struct btrfs_ioctl_send_args {
> +	__s64 send_fd;			/* in */
> +	__u64 clone_sources_count;	/* in */
> +	__u64 __user *clone_sources;	/* in */
> +	__u64 parent_root;		/* in */
> +	__u64 flags;			/* in */
> +	__u64 reserved[4];		/* in */
> +};
> +
>  #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
>  				   struct btrfs_ioctl_vol_args)
>  #define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
> @@ -371,6 +380,7 @@ struct btrfs_ioctl_received_subvol_args {
>  
>  #define BTRFS_IOC_SET_RECEIVED_SUBVOL _IOWR(BTRFS_IOCTL_MAGIC, 37, \
>  				struct btrfs_ioctl_received_subvol_args)
> +#define BTRFS_IOC_SEND _IOW(BTRFS_IOCTL_MAGIC, 38, struct btrfs_ioctl_send_args)
>  
>  #define BTRFS_IOC_GET_DEV_STATS _IOWR(BTRFS_IOCTL_MAGIC, 52, \
>  				      struct btrfs_ioctl_get_dev_stats)
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> new file mode 100644
> index 0000000..47a2557
> --- /dev/null
> +++ b/fs/btrfs/send.c
> @@ -0,0 +1,1009 @@
> +/*
> + * Copyright (C) 2012 Alexander Block.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public
> + * License v2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; if not, write to the
> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> + * Boston, MA 021110-1307, USA.
> + */
> +
> +#include <linux/bsearch.h>
> +#include <linux/fs.h>
> +#include <linux/file.h>
> +#include <linux/sort.h>
> +#include <linux/mount.h>
> +#include <linux/xattr.h>
> +#include <linux/posix_acl_xattr.h>
> +#include <linux/radix-tree.h>
> +#include <linux/crc32c.h>
> +
> +#include "send.h"
> +#include "backref.h"
> +#include "locking.h"
> +#include "disk-io.h"
> +#include "btrfs_inode.h"
> +#include "transaction.h"
> +
> +static int g_verbose = 0;
> +
> +#define verbose_printk(...) if (g_verbose) printk(__VA_ARGS__)

Maybe pr_debug is interesting to you.

> +
> +/*
> + * A fs_path is a helper to dynamically build path names with unknown size.
> + * It reallocates the internal buffer on demand.
> + * It allows fast adding of path elements on the right side (normal path) and
> + * fast adding to the left side (reversed path). A reversed path can also be
> + * unreversed if needed.
> + */
> +struct fs_path {
> +	union {
> +		struct {
> +			char *start;
> +			char *end;
> +			char *prepared;
> +
> +			char *buf;
> +			int buf_len;
> +			int reversed:1;
> +			int virtual_mem:1;

s/int/unsigned int/

> +			char inline_buf[];
> +		};
> +		char pad[PAGE_SIZE];
> +	};
> +};
> +#define FS_PATH_INLINE_SIZE \
> +	(sizeof(struct fs_path) - offsetof(struct fs_path, inline_buf))
> +
> +
> +/* reused for each extent */
> +struct clone_root {
> +	struct btrfs_root *root;
> +	u64 ino;
> +	u64 offset;
> +
> +	u64 found_refs;
> +};
> +
> +#define SEND_CTX_MAX_NAME_CACHE_SIZE 128
> +#define SEND_CTX_NAME_CACHE_CLEAN_SIZE (SEND_CTX_MAX_NAME_CACHE_SIZE * 2)
> +
> +struct send_ctx {
> +	struct file *send_filp;
> +	loff_t send_off;
> +	char *send_buf;
> +	u32 send_size;
> +	u32 send_max_size;
> +	u64 total_send_size;
> +	u64 cmd_send_size[BTRFS_SEND_C_MAX + 1];
> +
> +	struct vfsmount *mnt;
> +
> +	struct btrfs_root *send_root;
> +	struct btrfs_root *parent_root;
> +	struct clone_root *clone_roots;
> +	int clone_roots_cnt;
> +
> +	/* current state of the compare_tree call */
> +	struct btrfs_path *left_path;
> +	struct btrfs_path *right_path;
> +	struct btrfs_key *cmp_key;
> +
> +	/*
> +	 * infos of the currently processed inode. In case of deleted inodes,
> +	 * these are the values from the deleted inode.
> +	 */
> +	u64 cur_ino;
> +	u64 cur_inode_gen;
> +	int cur_inode_new;
> +	int cur_inode_new_gen;
> +	int cur_inode_deleted;
> +	u64 cur_inode_size;
> +	u64 cur_inode_mode;
> +
> +	u64 send_progress;
> +
> +	struct list_head new_refs;
> +	struct list_head deleted_refs;
> +
> +	struct radix_tree_root name_cache;
> +	struct list_head name_cache_list;
> +	int name_cache_size;
> +
> +	struct file *cur_inode_filp;
> +	char *read_buf;
> +};
> +
> +struct name_cache_entry {
> +	struct list_head list;
> +	struct list_head use_list;
> +	u64 ino;
> +	u64 gen;
> +	u64 parent_ino;
> +	u64 parent_gen;
> +	int ret;
> +	int need_later_update;
> +	int name_len;
> +	char name[];
> +};
> +
> +static void fs_path_reset(struct fs_path *p)
> +{
> +	if (p->reversed) {
> +		p->start = p->buf + p->buf_len - 1;
> +		p->end = p->start;
> +		*p->start = 0;
> +	} else {
> +		p->start = p->buf;
> +		p->end = p->start;
> +		*p->start = 0;
> +	}
> +}
> +
> +static struct fs_path *fs_path_alloc(struct send_ctx *sctx)

parameter unused.

> +{
> +	struct fs_path *p;
> +
> +	p = kmalloc(sizeof(*p), GFP_NOFS);
> +	if (!p)
> +		return NULL;
> +	p->reversed = 0;
> +	p->virtual_mem = 0;
> +	p->buf = p->inline_buf;
> +	p->buf_len = FS_PATH_INLINE_SIZE;
> +	fs_path_reset(p);
> +	return p;
> +}
> +
> +static struct fs_path *fs_path_alloc_reversed(struct send_ctx *sctx)

ditto.

> +{
> +	struct fs_path *p;
> +
> +	p = fs_path_alloc(sctx);
> +	if (!p)
> +		return NULL;
> +	p->reversed = 1;
> +	fs_path_reset(p);
> +	return p;
> +}
> +
> +static void fs_path_free(struct send_ctx *sctx, struct fs_path *p)

ditto, sctx unused.

> +{
> +	if (!p)
> +		return;
> +	if (p->buf != p->inline_buf) {
> +		if (p->virtual_mem)
> +			vfree(p->buf);
> +		else
> +			kfree(p->buf);
> +	}
> +	kfree(p);
> +}
> +
> +static int fs_path_len(struct fs_path *p)
> +{
> +	return p->end - p->start;
> +}
> +
> +static int fs_path_ensure_buf(struct fs_path *p, int len)
> +{
> +	char *tmp_buf;
> +	int path_len;
> +	int old_buf_len;
> +
> +	len++;

This looks a bit unmotivated, what is it for? It might be clearer
to add it to the calling site, if it has something to do with
0-termination.

> +
> +	if (p->buf_len >= len)
> +		return 0;
> +
> +	path_len = p->end - p->start;
> +	old_buf_len = p->buf_len;
> +	len = PAGE_ALIGN(len);
> +
> +	if (p->buf == p->inline_buf) {
> +		tmp_buf = kmalloc(len, GFP_NOFS);
> +		if (!tmp_buf) {
> +			tmp_buf = vmalloc(len);

have you tested this path?

> +			if (!tmp_buf)
> +				return -ENOMEM;
> +			p->virtual_mem = 1;
> +		}
> +		memcpy(tmp_buf, p->buf, p->buf_len);
> +		p->buf = tmp_buf;
> +		p->buf_len = len;
> +	} else {
> +		if (p->virtual_mem) {
> +			tmp_buf = vmalloc(len);
> +			if (!tmp_buf)
> +				return -ENOMEM;
> +			memcpy(tmp_buf, p->buf, p->buf_len);
> +			vfree(p->buf);
> +		} else {
> +			tmp_buf = krealloc(p->buf, len, GFP_NOFS);
> +			if (!tmp_buf) {
> +				tmp_buf = vmalloc(len);
> +				if (!tmp_buf)
> +					return -ENOMEM;
> +				memcpy(tmp_buf, p->buf, p->buf_len);
> +				kfree(p->buf);
> +				p->virtual_mem = 1;
> +			}
> +		}
> +		p->buf = tmp_buf;
> +		p->buf_len = len;
> +	}
> +	if (p->reversed) {
> +		tmp_buf = p->buf + old_buf_len - path_len - 1;
> +		p->end = p->buf + p->buf_len - 1;
> +		p->start = p->end - path_len;
> +		memmove(p->start, tmp_buf, path_len + 1);

First you copy it, then you move it again? There's room for optimization
here ;)

> +	} else {
> +		p->start = p->buf;
> +		p->end = p->start + path_len;
> +	}
> +	return 0;
> +}
> +
> +static int fs_path_prepare_for_add(struct fs_path *p, int name_len)
> +{
> +	int ret;
> +	int new_len;
> +
> +	new_len = p->end - p->start + name_len;
> +	if (p->start != p->end)
> +		new_len++;
> +	ret = fs_path_ensure_buf(p, new_len);
> +	if (ret < 0)
> +		goto out;
> +
> +	if (p->reversed) {
> +		if (p->start != p->end)
> +			*--p->start = '/';
> +		p->start -= name_len;
> +		p->prepared = p->start;
> +	} else {
> +		if (p->start != p->end)
> +			*p->end++ = '/';
> +		p->prepared = p->end;
> +		p->end += name_len;
> +		*p->end = 0;
> +	}
> +
> +out:
> +	return ret;
> +}
> +
> +static int fs_path_add(struct fs_path *p, char *name, int name_len)
> +{
> +	int ret;
> +
> +	ret = fs_path_prepare_for_add(p, name_len);
> +	if (ret < 0)
> +		goto out;
> +	memcpy(p->prepared, name, name_len);
> +	p->prepared = NULL;
> +
> +out:
> +	return ret;
> +}
> +
> +static int fs_path_add_path(struct fs_path *p, struct fs_path *p2)
> +{
> +	int ret;
> +
> +	ret = fs_path_prepare_for_add(p, p2->end - p2->start);
> +	if (ret < 0)
> +		goto out;
> +	memcpy(p->prepared, p2->start, p2->end - p2->start);
> +	p->prepared = NULL;
> +
> +out:
> +	return ret;
> +}
> +
> +static int fs_path_add_from_extent_buffer(struct fs_path *p,
> +					  struct extent_buffer *eb,
> +					  unsigned long off, int len)
> +{
> +	int ret;
> +
> +	ret = fs_path_prepare_for_add(p, len);
> +	if (ret < 0)
> +		goto out;
> +
> +	read_extent_buffer(eb, p->prepared, off, len);
> +	p->prepared = NULL;
> +
> +out:
> +	return ret;
> +}
> +
> +static int fs_path_copy(struct fs_path *p, struct fs_path *from)
> +{
> +	int ret;
> +
> +	p->reversed = from->reversed;
> +	fs_path_reset(p);
> +
> +	ret = fs_path_add_path(p, from);
> +
> +	return ret;
> +}
> +
> +
> +static void fs_path_unreverse(struct fs_path *p)
> +{
> +	char *tmp;
> +	int len;
> +
> +	if (!p->reversed)
> +		return;
> +
> +	tmp = p->start;
> +	len = p->end - p->start;
> +	p->start = p->buf;
> +	p->end = p->start + len;
> +	memmove(p->start, tmp, len + 1);
> +	p->reversed = 0;

oh, reversed doesn't mean that the path is reversed, but only that
you are prepending components? Otherwise this function doesn't look
like it would reverse anything. So maybe something with 'prepend' in
it might be a better name than 'reverse' for all occurrences of
'reverse' above.

> +}
> +
> +static struct btrfs_path *alloc_path_for_send(void)
> +{
> +	struct btrfs_path *path;
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return NULL;
> +	path->search_commit_root = 1;
> +	path->skip_locking = 1;
> +	return path;
> +}
> +
> +static int write_buf(struct send_ctx *sctx, const void *buf, u32 len)
> +{
> +	int ret;
> +	mm_segment_t old_fs;
> +	u32 pos = 0;
> +
> +	old_fs = get_fs();
> +	set_fs(KERNEL_DS);
> +
> +	while (pos < len) {
> +		ret = vfs_write(sctx->send_filp, (char *)buf + pos, len - pos,
> +				&sctx->send_off);
> +		/* TODO handle that correctly */
> +		/*if (ret == -ERESTARTSYS) {
> +			continue;
> +		}*/

I prefer #if 0 over comments to disable code, but I don't know what the
styleguide has to say to that.

> +		if (ret < 0) {
> +			printk("%d\n", ret);

This is not the most verbose error message of all :)

> +			goto out;
> +		}
> +		if (ret == 0) {
> +			ret = -EIO;
> +			goto out;
> +		}
> +		pos += ret;
> +	}
> +
> +	ret = 0;
> +
> +out:
> +	set_fs(old_fs);
> +	return ret;
> +}
> +
> +static int tlv_put(struct send_ctx *sctx, u16 attr, const void *data, int len)
> +{
> +	struct btrfs_tlv_header *hdr;
> +	int total_len = sizeof(*hdr) + len;
> +	int left = sctx->send_max_size - sctx->send_size;
> +
> +	if (unlikely(left < total_len))
> +		return -EOVERFLOW;
> +
> +	hdr = (struct btrfs_tlv_header *) (sctx->send_buf + sctx->send_size);
> +	hdr->tlv_type = cpu_to_le16(attr);
> +	hdr->tlv_len = cpu_to_le16(len);

you might want to check for len overflow here

> +	memcpy(hdr + 1, data, len);
> +	sctx->send_size += total_len;
> +
> +	return 0;
> +}
> +
> +#if 0
> +static int tlv_put_u8(struct send_ctx *sctx, u16 attr, u8 value)
> +{
> +	return tlv_put(sctx, attr, &value, sizeof(value));
> +}
> +
> +static int tlv_put_u16(struct send_ctx *sctx, u16 attr, u16 v)
> +{
> +	__le16 tmp = cpu_to_le16(value);

s/value/v

> +	return tlv_put(sctx, attr, &tmp, sizeof(tmp));
> +}
> +
> +static int tlv_put_u32(struct send_ctx *sctx, u16 attr, u32 value)
> +{
> +	__le32 tmp = cpu_to_le32(value);
> +	return tlv_put(sctx, attr, &tmp, sizeof(tmp));
> +}
> +#endif
> +
> +static int tlv_put_u64(struct send_ctx *sctx, u16 attr, u64 value)
> +{
> +	__le64 tmp = cpu_to_le64(value);
> +	return tlv_put(sctx, attr, &tmp, sizeof(tmp));
> +}
> +
> +static int tlv_put_string(struct send_ctx *sctx, u16 attr,
> +			  const char *str, int len)
> +{
> +	if (len == -1)
> +		len = strlen(str);
> +	return tlv_put(sctx, attr, str, len);
> +}
> +
> +static int tlv_put_uuid(struct send_ctx *sctx, u16 attr,
> +			const u8 *uuid)
> +{
> +	return tlv_put(sctx, attr, uuid, BTRFS_UUID_SIZE);
> +}
> +
> +#if 0
> +static int tlv_put_timespec(struct send_ctx *sctx, u16 attr,
> +			    struct timespec *ts)
> +{
> +	struct btrfs_timespec bts;
> +	bts.sec = cpu_to_le64(ts->tv_sec);
> +	bts.nsec = cpu_to_le32(ts->tv_nsec);
> +	return tlv_put(sctx, attr, &bts, sizeof(bts));
> +}
> +#endif
> +
> +static int tlv_put_btrfs_timespec(struct send_ctx *sctx, u16 attr,
> +				  struct extent_buffer *eb,
> +				  struct btrfs_timespec *ts)
> +{
> +	struct btrfs_timespec bts;
> +	read_extent_buffer(eb, &bts, (unsigned long)ts, sizeof(bts));
> +	return tlv_put(sctx, attr, &bts, sizeof(bts));
> +}
> +
> +
> +#define TLV_PUT(sctx, attrtype, attrlen, data) \
> +	do { \
> +		ret = tlv_put(sctx, attrtype, attrlen, data); \
> +		if (ret < 0) \
> +			goto tlv_put_failure; \
> +	} while (0)
> +
> +#define TLV_PUT_INT(sctx, attrtype, bits, value) \
> +	do { \
> +		ret = tlv_put_u##bits(sctx, attrtype, value); \
> +		if (ret < 0) \
> +			goto tlv_put_failure; \
> +	} while (0)
> +
> +#define TLV_PUT_U8(sctx, attrtype, data) TLV_PUT_INT(sctx, attrtype, 8, data)
> +#define TLV_PUT_U16(sctx, attrtype, data) TLV_PUT_INT(sctx, attrtype, 16, data)
> +#define TLV_PUT_U32(sctx, attrtype, data) TLV_PUT_INT(sctx, attrtype, 32, data)
> +#define TLV_PUT_U64(sctx, attrtype, data) TLV_PUT_INT(sctx, attrtype, 64, data)
> +#define TLV_PUT_STRING(sctx, attrtype, str, len) \
> +	do { \
> +		ret = tlv_put_string(sctx, attrtype, str, len); \
> +		if (ret < 0) \
> +			goto tlv_put_failure; \
> +	} while (0)
> +#define TLV_PUT_PATH(sctx, attrtype, p) \
> +	do { \
> +		ret = tlv_put_string(sctx, attrtype, p->start, \
> +			p->end - p->start); \
> +		if (ret < 0) \
> +			goto tlv_put_failure; \
> +	} while(0)
> +#define TLV_PUT_UUID(sctx, attrtype, uuid) \
> +	do { \
> +		ret = tlv_put_uuid(sctx, attrtype, uuid); \
> +		if (ret < 0) \
> +			goto tlv_put_failure; \
> +	} while (0)
> +#define TLV_PUT_TIMESPEC(sctx, attrtype, ts) \
> +	do { \
> +		ret = tlv_put_timespec(sctx, attrtype, ts); \
> +		if (ret < 0) \
> +			goto tlv_put_failure; \
> +	} while (0)
> +#define TLV_PUT_BTRFS_TIMESPEC(sctx, attrtype, eb, ts) \
> +	do { \
> +		ret = tlv_put_btrfs_timespec(sctx, attrtype, eb, ts); \
> +		if (ret < 0) \
> +			goto tlv_put_failure; \
> +	} while (0)
> +
> +static int send_header(struct send_ctx *sctx)
> +{
> +	int ret;
> +	struct btrfs_stream_header hdr;
> +
> +	strcpy(hdr.magic, BTRFS_SEND_STREAM_MAGIC);
> +	hdr.version = cpu_to_le32(BTRFS_SEND_STREAM_VERSION);
> +
> +	ret = write_buf(sctx, &hdr, sizeof(hdr));
> +
> +	return ret;

(just return write_buf)

> +}
> +
> +/*
> + * For each command/item we want to send to userspace, we call this function.
> + */
> +static int begin_cmd(struct send_ctx *sctx, int cmd)
> +{
> +	int ret = 0;
> +	struct btrfs_cmd_header *hdr;
> +
> +	if (!sctx->send_buf) {
> +		WARN_ON(1);
> +		return -EINVAL;
> +	}
> +
> +	BUG_ON(!sctx->send_buf);

that's kind of redundant.

> +	BUG_ON(sctx->send_size);
> +
> +	sctx->send_size += sizeof(*hdr);
> +	hdr = (struct btrfs_cmd_header *)sctx->send_buf;
> +	hdr->cmd = cpu_to_le16(cmd);
> +
> +	return ret;

ret is untouched here

> +}
> +
> +static int send_cmd(struct send_ctx *sctx)
> +{
> +	int ret;
> +	struct btrfs_cmd_header *hdr;
> +	u32 crc;
> +
> +	hdr = (struct btrfs_cmd_header *)sctx->send_buf;
> +	hdr->len = cpu_to_le32(sctx->send_size - sizeof(*hdr));
> +	hdr->crc = 0;
> +
> +	crc = crc32c(0, (unsigned char *)sctx->send_buf, sctx->send_size);
> +	hdr->crc = cpu_to_le32(crc);
> +
> +	ret = write_buf(sctx, sctx->send_buf, sctx->send_size);
> +
> +	sctx->total_send_size += sctx->send_size;
> +	sctx->cmd_send_size[le16_to_cpu(hdr->cmd)] += sctx->send_size;
> +	sctx->send_size = 0;
> +
> +	return ret;
> +}
> +
> +/*
> + * Sends a move instruction to user space
> + */
> +static int send_rename(struct send_ctx *sctx,
> +		     struct fs_path *from, struct fs_path *to)
> +{
> +	int ret;
> +
> +verbose_printk("btrfs: send_rename %s -> %s\n", from->start, to->start);
> +
> +	ret = begin_cmd(sctx, BTRFS_SEND_C_RENAME);
> +	if (ret < 0)
> +		goto out;
> +
> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, from);
> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH_TO, to);
> +
> +	ret = send_cmd(sctx);
> +
> +tlv_put_failure:
> +out:
> +	return ret;
> +}
> +
> +/*
> + * Sends a link instruction to user space
> + */
> +static int send_link(struct send_ctx *sctx,
> +		     struct fs_path *path, struct fs_path *lnk)
> +{
> +	int ret;
> +
> +verbose_printk("btrfs: send_link %s -> %s\n", path->start, lnk->start);
> +
> +	ret = begin_cmd(sctx, BTRFS_SEND_C_LINK);
> +	if (ret < 0)
> +		goto out;
> +
> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path);
> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH_LINK, lnk);
> +
> +	ret = send_cmd(sctx);
> +
> +tlv_put_failure:
> +out:
> +	return ret;
> +}
> +
> +/*
> + * Sends an unlink instruction to user space
> + */
> +static int send_unlink(struct send_ctx *sctx, struct fs_path *path)
> +{
> +	int ret;
> +
> +verbose_printk("btrfs: send_unlink %s\n", path->start);
> +
> +	ret = begin_cmd(sctx, BTRFS_SEND_C_UNLINK);
> +	if (ret < 0)
> +		goto out;
> +
> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path);
> +
> +	ret = send_cmd(sctx);
> +
> +tlv_put_failure:
> +out:
> +	return ret;
> +}
> +
> +/*
> + * Sends a rmdir instruction to user space
> + */
> +static int send_rmdir(struct send_ctx *sctx, struct fs_path *path)
> +{
> +	int ret;
> +
> +verbose_printk("btrfs: send_rmdir %s\n", path->start);
> +
> +	ret = begin_cmd(sctx, BTRFS_SEND_C_RMDIR);
> +	if (ret < 0)
> +		goto out;
> +
> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path);
> +
> +	ret = send_cmd(sctx);
> +
> +tlv_put_failure:
> +out:
> +	return ret;
> +}
> +
> +/*
> + * Helper function to retrieve some fields from an inode item.
> + */
> +static int get_inode_info(struct btrfs_root *root,
> +			  u64 ino, u64 *size, u64 *gen,
> +			  u64 *mode, u64 *uid, u64 *gid)
> +{
> +	int ret;
> +	struct btrfs_inode_item *ii;
> +	struct btrfs_key key;
> +	struct btrfs_path *path;
> +
> +	path = alloc_path_for_send();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	key.objectid = ino;
> +	key.type = BTRFS_INODE_ITEM_KEY;
> +	key.offset = 0;
> +	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> +	if (ret < 0)
> +		goto out;
> +	if (ret) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +
> +	ii = btrfs_item_ptr(path->nodes[0], path->slots[0],
> +			struct btrfs_inode_item);
> +	if (size)
> +		*size = btrfs_inode_size(path->nodes[0], ii);
> +	if (gen)
> +		*gen = btrfs_inode_generation(path->nodes[0], ii);
> +	if (mode)
> +		*mode = btrfs_inode_mode(path->nodes[0], ii);
> +	if (uid)
> +		*uid = btrfs_inode_uid(path->nodes[0], ii);
> +	if (gid)
> +		*gid = btrfs_inode_gid(path->nodes[0], ii);
> +
> +out:
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
> +typedef int (*iterate_inode_ref_t)(int num, u64 dir, int index,
> +				   struct fs_path *p,
> +				   void *ctx);
> +
> +/*
> + * Helper function to iterate the entries in ONE btrfs_inode_ref.
> + * The iterate callback may return a non zero value to stop iteration. This can
> + * be a negative value for error codes or 1 to simply stop it.
> + *
> + * path must point to the INODE_REF when called.
> + */
> +static int iterate_inode_ref(struct send_ctx *sctx,
> +			     struct btrfs_root *root, struct btrfs_path *path,
> +			     struct btrfs_key *found_key, int resolve,
> +			     iterate_inode_ref_t iterate, void *ctx)
> +{
> +	struct extent_buffer *eb;
> +	struct btrfs_item *item;
> +	struct btrfs_inode_ref *iref;
> +	struct btrfs_path *tmp_path;
> +	struct fs_path *p;
> +	u32 cur;
> +	u32 len;
> +	u32 total;
> +	int slot;
> +	u32 name_len;
> +	char *start;
> +	int ret = 0;
> +	int num;
> +	int index;
> +
> +	p = fs_path_alloc_reversed(sctx);
> +	if (!p)
> +		return -ENOMEM;
> +
> +	tmp_path = alloc_path_for_send();
> +	if (!tmp_path) {
> +		fs_path_free(sctx, p);
> +		return -ENOMEM;
> +	}
> +
> +	eb = path->nodes[0];
> +	slot = path->slots[0];
> +	item = btrfs_item_nr(eb, slot);
> +	iref = btrfs_item_ptr(eb, slot, struct btrfs_inode_ref);
> +	cur = 0;
> +	len = 0;
> +	total = btrfs_item_size(eb, item);
> +
> +	num = 0;
> +	while (cur < total) {
> +		fs_path_reset(p);
> +
> +		name_len = btrfs_inode_ref_name_len(eb, iref);
> +		index = btrfs_inode_ref_index(eb, iref);
> +		if (resolve) {
> +			start = btrfs_iref_to_path(root, tmp_path, iref, eb,
> +						found_key->offset, p->buf,
> +						p->buf_len);

it might be worth it to build a better integration between
iref_to_path and your fs_path data structure. Maybe iref_to_path
can make direct use of fs_path.

> +			if (IS_ERR(start)) {
> +				ret = PTR_ERR(start);
> +				goto out;
> +			}
> +			if (start < p->buf) {
> +				/* overflow , try again with larger buffer */
> +				ret = fs_path_ensure_buf(p,
> +						p->buf_len + p->buf - start);
> +				if (ret < 0)
> +					goto out;
> +				start = btrfs_iref_to_path(root, tmp_path, iref,
> +						eb, found_key->offset, p->buf,
> +						p->buf_len);
> +				if (IS_ERR(start)) {
> +					ret = PTR_ERR(start);
> +					goto out;
> +				}
> +				BUG_ON(start < p->buf);
> +			}
> +			p->start = start;
> +		} else {
> +			ret = fs_path_add_from_extent_buffer(p, eb,
> +					(unsigned long)(iref + 1), name_len);
> +			if (ret < 0)
> +				goto out;
> +		}
> +
> +
> +		len = sizeof(*iref) + name_len;
> +		iref = (struct btrfs_inode_ref *)((char *)iref + len);
> +		cur += len;
> +
> +		ret = iterate(num, found_key->offset, index, p, ctx);
> +		if (ret < 0)
> +			goto out;
> +		if (ret) {
> +			ret = 0;

wouldn't it make sense to pass this information on to the caller?

> +			goto out;
> +		}
> +
> +		num++;
> +	}
> +
> +out:
> +	btrfs_free_path(tmp_path);
> +	fs_path_free(sctx, p);
> +	return ret;
> +}
> +
> +typedef int (*iterate_dir_item_t)(int num, const char *name, int name_len,
> +				  const char *data, int data_len,
> +				  u8 type, void *ctx);
> +
> +/*
> + * Helper function to iterate the entries in ONE btrfs_dir_item.
> + * The iterate callback may return a non zero value to stop iteration. This can
> + * be a negative value for error codes or 1 to simply stop it.
> + *
> + * path must point to the dir item when called.
> + */
> +static int iterate_dir_item(struct send_ctx *sctx,
> +			    struct btrfs_root *root, struct btrfs_path *path,
> +			    struct btrfs_key *found_key,
> +			    iterate_dir_item_t iterate, void *ctx)
> +{
> +	int ret = 0;
> +	struct extent_buffer *eb;
> +	struct btrfs_item *item;
> +	struct btrfs_dir_item *di;
> +	struct btrfs_path *tmp_path = NULL;
> +	char *buf = NULL;
> +	char *buf2 = NULL;
> +	int buf_len;
> +	int buf_virtual = 0;
> +	u32 name_len;
> +	u32 data_len;
> +	u32 cur;
> +	u32 len;
> +	u32 total;
> +	int slot;
> +	int num;
> +	u8 type;
> +
> +	buf_len = PAGE_SIZE;
> +	buf = kmalloc(buf_len, GFP_NOFS);
> +	if (!buf) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	tmp_path = alloc_path_for_send();
> +	if (!tmp_path) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	eb = path->nodes[0];
> +	slot = path->slots[0];
> +	item = btrfs_item_nr(eb, slot);
> +	di = btrfs_item_ptr(eb, slot, struct btrfs_dir_item);
> +	cur = 0;
> +	len = 0;
> +	total = btrfs_item_size(eb, item);
> +
> +	num = 0;
> +	while (cur < total) {
> +		name_len = btrfs_dir_name_len(eb, di);
> +		data_len = btrfs_dir_data_len(eb, di);
> +		type = btrfs_dir_type(eb, di);
> +
> +		if (name_len + data_len > buf_len) {
> +			buf_len = PAGE_ALIGN(name_len + data_len);
> +			if (buf_virtual) {
> +				buf2 = vmalloc(buf_len);
> +				if (!buf2) {
> +					ret = -ENOMEM;
> +					goto out;
> +				}
> +				vfree(buf);
> +			} else {
> +				buf2 = krealloc(buf, buf_len, GFP_NOFS);
> +				if (!buf2) {
> +					buf2 = vmalloc(buf_len);
> +					if (!buf) {

!buf2

> +						ret = -ENOMEM;
> +						goto out;
> +					}
> +					kfree(buf);
> +					buf_virtual = 1;
> +				}
> +			}
> +
> +			buf = buf2;
> +			buf2 = NULL;
> +		}
> +
> +		read_extent_buffer(eb, buf, (unsigned long)(di + 1),
> +				name_len + data_len);
> +
> +		len = sizeof(*di) + name_len + data_len;
> +		di = (struct btrfs_dir_item *)((char *)di + len);
> +		cur += len;
> +
> +		ret = iterate(num, buf, name_len, buf + name_len, data_len,
> +				type, ctx);
> +		if (ret < 0)
> +			goto out;
> +		if (ret) {
> +			ret = 0;
> +			goto out;
> +		}
> +
> +		num++;
> +	}
> +
> +out:
> +	btrfs_free_path(tmp_path);
> +	if (buf_virtual)
> +		vfree(buf);
> +	else
> +		kfree(buf);
> +	return ret;
> +}
> +
> +static int __copy_first_ref(int num, u64 dir, int index,
> +			    struct fs_path *p, void *ctx)
> +{
> +	int ret;
> +	struct fs_path *pt = ctx;
> +
> +	ret = fs_path_copy(pt, p);
> +	if (ret < 0)
> +		return ret;
> +
> +	/* we want the first only */
> +	return 1;
> +}
> +
> +/*
> + * Retrieve the first path of an inode. If an inode has more then one
> + * ref/hardlink, this is ignored.
> + */
> +static int get_inode_path(struct send_ctx *sctx, struct btrfs_root *root,
> +			  u64 ino, struct fs_path *path)
> +{
> +	int ret;
> +	struct btrfs_key key, found_key;
> +	struct btrfs_path *p;
> +
> +	p = alloc_path_for_send();
> +	if (!p)
> +		return -ENOMEM;
> +
> +	fs_path_reset(path);
> +
> +	key.objectid = ino;
> +	key.type = BTRFS_INODE_REF_KEY;
> +	key.offset = 0;
> +
> +	ret = btrfs_search_slot_for_read(root, &key, p, 1, 0);
> +	if (ret < 0)
> +		goto out;
> +	if (ret) {
> +		ret = 1;
> +		goto out;
> +	}
> +	btrfs_item_key_to_cpu(p->nodes[0], &found_key, p->slots[0]);
> +	if (found_key.objectid != ino ||
> +		found_key.type != BTRFS_INODE_REF_KEY) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +
> +	ret = iterate_inode_ref(sctx, root, p, &found_key, 1,
> +			__copy_first_ref, path);
> +	if (ret < 0)
> +		goto out;
> +	ret = 0;
> +
> +out:
> +	btrfs_free_path(p);
> +	return ret;
> +}
> +
> diff --git a/fs/btrfs/send.h b/fs/btrfs/send.h
> new file mode 100644
> index 0000000..a4c23ee
> --- /dev/null
> +++ b/fs/btrfs/send.h
> @@ -0,0 +1,126 @@
> +/*
> + * Copyright (C) 2012 Alexander Block.  All rights reserved.
> + * Copyright (C) 2012 STRATO.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public
> + * License v2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; if not, write to the
> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> + * Boston, MA 021110-1307, USA.
> + */
> +
> +#include "ctree.h"
> +
> +#define BTRFS_SEND_STREAM_MAGIC "btrfs-stream"
> +#define BTRFS_SEND_STREAM_VERSION 1
> +
> +#define BTRFS_SEND_BUF_SIZE (1024 * 64)
> +#define BTRFS_SEND_READ_SIZE (1024 * 48)
> +
> +enum btrfs_tlv_type {
> +	BTRFS_TLV_U8,
> +	BTRFS_TLV_U16,
> +	BTRFS_TLV_U32,
> +	BTRFS_TLV_U64,
> +	BTRFS_TLV_BINARY,
> +	BTRFS_TLV_STRING,
> +	BTRFS_TLV_UUID,
> +	BTRFS_TLV_TIMESPEC,
> +};
> +
> +struct btrfs_stream_header {
> +	char magic[sizeof(BTRFS_SEND_STREAM_MAGIC)];
> +	__le32 version;
> +} __attribute__ ((__packed__));
> +
> +struct btrfs_cmd_header {
> +	__le32 len;
> +	__le16 cmd;
> +	__le32 crc;
> +} __attribute__ ((__packed__));

Please add some comments for this struct, e.g. that the len
is the len excluding the header, and the crc is calculated
over the full cmd including header with crc set to 0.

> +
> +struct btrfs_tlv_header {
> +	__le16 tlv_type;
> +	__le16 tlv_len;
> +} __attribute__ ((__packed__));
> +
> +/* commands */
> +enum btrfs_send_cmd {
> +	BTRFS_SEND_C_UNSPEC,
> +
> +	BTRFS_SEND_C_SUBVOL,
> +	BTRFS_SEND_C_SNAPSHOT,
> +
> +	BTRFS_SEND_C_MKFILE,
> +	BTRFS_SEND_C_MKDIR,
> +	BTRFS_SEND_C_MKNOD,
> +	BTRFS_SEND_C_MKFIFO,
> +	BTRFS_SEND_C_MKSOCK,
> +	BTRFS_SEND_C_SYMLINK,
> +
> +	BTRFS_SEND_C_RENAME,
> +	BTRFS_SEND_C_LINK,
> +	BTRFS_SEND_C_UNLINK,
> +	BTRFS_SEND_C_RMDIR,
> +
> +	BTRFS_SEND_C_SET_XATTR,
> +	BTRFS_SEND_C_REMOVE_XATTR,
> +
> +	BTRFS_SEND_C_WRITE,
> +	BTRFS_SEND_C_CLONE,
> +
> +	BTRFS_SEND_C_TRUNCATE,
> +	BTRFS_SEND_C_CHMOD,
> +	BTRFS_SEND_C_CHOWN,
> +	BTRFS_SEND_C_UTIMES,
> +
> +	BTRFS_SEND_C_END,
> +	__BTRFS_SEND_C_MAX,
> +};
> +#define BTRFS_SEND_C_MAX (__BTRFS_SEND_C_MAX - 1)
> +
> +/* attributes in send stream */
> +enum {
> +	BTRFS_SEND_A_UNSPEC,
> +
> +	BTRFS_SEND_A_UUID,
> +	BTRFS_SEND_A_CTRANSID,
> +
> +	BTRFS_SEND_A_INO,
> +	BTRFS_SEND_A_SIZE,
> +	BTRFS_SEND_A_MODE,
> +	BTRFS_SEND_A_UID,
> +	BTRFS_SEND_A_GID,
> +	BTRFS_SEND_A_RDEV,
> +	BTRFS_SEND_A_CTIME,
> +	BTRFS_SEND_A_MTIME,
> +	BTRFS_SEND_A_ATIME,
> +	BTRFS_SEND_A_OTIME,
> +
> +	BTRFS_SEND_A_XATTR_NAME,
> +	BTRFS_SEND_A_XATTR_DATA,
> +
> +	BTRFS_SEND_A_PATH,
> +	BTRFS_SEND_A_PATH_TO,
> +	BTRFS_SEND_A_PATH_LINK,
> +
> +	BTRFS_SEND_A_FILE_OFFSET,
> +	BTRFS_SEND_A_DATA,
> +
> +	BTRFS_SEND_A_CLONE_UUID,
> +	BTRFS_SEND_A_CLONE_CTRANSID,
> +	BTRFS_SEND_A_CLONE_PATH,
> +	BTRFS_SEND_A_CLONE_OFFSET,
> +	BTRFS_SEND_A_CLONE_LEN,
> +
> +	__BTRFS_SEND_A_MAX,
> +};
> +#define BTRFS_SEND_A_MAX (__BTRFS_SEND_A_MAX - 1)


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 6/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 1)
  2012-07-04 13:38 ` [RFC PATCH 6/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 1) Alexander Block
  2012-07-18  6:59   ` Arne Jansen
@ 2012-07-21 10:53   ` Arne Jansen
  1 sibling, 0 replies; 43+ messages in thread
From: Arne Jansen @ 2012-07-21 10:53 UTC (permalink / raw)
  To: Alexander Block; +Cc: linux-btrfs

On 07/04/2012 03:38 PM, Alexander Block wrote:
> This patch introduces the BTRFS_IOC_SEND ioctl that is
> required for send. It allows btrfs-progs to implement
> full and incremental sends. Patches for btrfs-progs will
> follow.
> 
> I had to split the patch as it got larger then 100k which is
> the limit for the mailing list. The first part only contains
> the send.h header and the helper functions for TLV handling
> and long path name handling and some other helpers. The second
> part contains the actual send logic from send.c
> 
> Signed-off-by: Alexander Block <ablock84@googlemail.com>
> ---
[snip]
> +
> +struct name_cache_entry {
> +	struct list_head list;
> +	struct list_head use_list;

unused.

> +	u64 ino;
> +	u64 gen;
> +	u64 parent_ino;
> +	u64 parent_gen;
> +	int ret;
> +	int need_later_update;
> +	int name_len;
> +	char name[];
> +};
> +

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 7/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 2)
  2012-07-04 13:38 ` [RFC PATCH 7/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 2) Alexander Block
  2012-07-10 15:26   ` Alex Lyakas
@ 2012-07-23 11:16   ` Arne Jansen
  2012-07-23 15:28     ` Alex Lyakas
  2012-07-28 13:49     ` Alexander Block
  2012-07-23 15:17   ` Alex Lyakas
  2 siblings, 2 replies; 43+ messages in thread
From: Arne Jansen @ 2012-07-23 11:16 UTC (permalink / raw)
  To: Alexander Block; +Cc: linux-btrfs

This is a first review run. I ask for more comments in several places.
Maybe these comments can help to dive deeper into a functional review
in a second run.
I'd really appreciate it if you could write a few pages about the
concepts how you decide what to send and when.
It seems there's still a lot of headroom for performance optimizations
cpu/seek-wise.
All in all I really like this work.

On 04.07.2012 15:38, Alexander Block wrote:
> This is the second part of the splitted BTRFS_IOC_SEND patch which
> contains the actual send logic.
> 
> Signed-off-by: Alexander Block <ablock84@googlemail.com>
> ---
>  fs/btrfs/ioctl.c |    3 +
>  fs/btrfs/send.c  | 3246 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/send.h  |    4 +
>  3 files changed, 3253 insertions(+)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 8d258cb..9173867 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -54,6 +54,7 @@
>  #include "inode-map.h"
>  #include "backref.h"
>  #include "rcu-string.h"
> +#include "send.h"
>  
>  /* Mask out flags that are inappropriate for the given type of inode. */
>  static inline __u32 btrfs_mask_flags(umode_t mode, __u32 flags)
> @@ -3567,6 +3568,8 @@ long btrfs_ioctl(struct file *file, unsigned int
>  		return btrfs_ioctl_balance_progress(root, argp);
>  	case BTRFS_IOC_SET_RECEIVED_SUBVOL:
>  		return btrfs_ioctl_set_received_subvol(file, argp);
> +	case BTRFS_IOC_SEND:
> +		return btrfs_ioctl_send(file, argp);
>  	case BTRFS_IOC_GET_DEV_STATS:
>  		return btrfs_ioctl_get_dev_stats(root, argp, 0);
>  	case BTRFS_IOC_GET_AND_RESET_DEV_STATS:
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index 47a2557..4d3fcfc 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -1007,3 +1007,3249 @@ out:
>  	return ret;
>  }
>  
> +struct backref_ctx {
> +	struct send_ctx *sctx;
> +
> +	/* number of total found references */
> +	u64 found;
> +
> +	/*
> +	 * used for clones found in send_root. clones found behind cur_objectid
> +	 * and cur_offset are not considered as allowed clones.
> +	 */
> +	u64 cur_objectid;
> +	u64 cur_offset;
> +
> +	/* may be truncated in case it's the last extent in a file */
> +	u64 extent_len;
> +
> +	/* Just to check for bugs in backref resolving */
> +	int found_in_send_root;
> +};
> +
> +static int __clone_root_cmp_bsearch(const void *key, const void *elt)
> +{
> +	u64 root = (u64)key;
> +	struct clone_root *cr = (struct clone_root *)elt;
> +
> +	if (root < cr->root->objectid)
> +		return -1;
> +	if (root > cr->root->objectid)
> +		return 1;
> +	return 0;
> +}
> +
> +static int __clone_root_cmp_sort(const void *e1, const void *e2)
> +{
> +	struct clone_root *cr1 = (struct clone_root *)e1;
> +	struct clone_root *cr2 = (struct clone_root *)e2;
> +
> +	if (cr1->root->objectid < cr2->root->objectid)
> +		return -1;
> +	if (cr1->root->objectid > cr2->root->objectid)
> +		return 1;
> +	return 0;
> +}
> +
> +/*
> + * Called for every backref that is found for the current extent.

Comment: results are collected in sctx->clone_roots->ino/offset/found_refs

> + */
> +static int __iterate_backrefs(u64 ino, u64 offset, u64 root, void *ctx_)
> +{
> +	struct backref_ctx *bctx = ctx_;
> +	struct clone_root *found;
> +	int ret;
> +	u64 i_size;
> +
> +	/* First check if the root is in the list of accepted clone sources */
> +	found = bsearch((void *)root, bctx->sctx->clone_roots,
> +			bctx->sctx->clone_roots_cnt,
> +			sizeof(struct clone_root),
> +			__clone_root_cmp_bsearch);
> +	if (!found)
> +		return 0;
> +
> +	if (found->root == bctx->sctx->send_root &&
> +	    ino == bctx->cur_objectid &&
> +	    offset == bctx->cur_offset) {
> +		bctx->found_in_send_root = 1;

found_in_send_root_and_cur_ino_offset?

> +	}
> +
> +	/*
> +	 * There are inodes that have extents that lie behind it's i_size. Don't
                                                              its
> +	 * accept clones from these extents.
> +	 */
> +	ret = get_inode_info(found->root, ino, &i_size, NULL, NULL, NULL, NULL);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (offset + bctx->extent_len > i_size)
> +		return 0;
> +
> +	/*
> +	 * Make sure we don't consider clones from send_root that are
> +	 * behind the current inode/offset.
> +	 */
> +	if (found->root == bctx->sctx->send_root) {
> +		/*
> +		 * TODO for the moment we don't accept clones from the inode
> +		 * that is currently send. We may change this when
> +		 * BTRFS_IOC_CLONE_RANGE supports cloning from and to the same
> +		 * file.
> +		 */
> +		if (ino >= bctx->cur_objectid)
> +			return 0;
> +		/*if (ino > ctx->cur_objectid)
> +			return 0;
> +		if (offset + ctx->extent_len > ctx->cur_offset)
> +			return 0;*/

#if 0 ... #else ... #endif

> +
> +		bctx->found++;
> +		found->found_refs++;
> +		found->ino = ino;
> +		found->offset = offset;

only the last ino is kept?

> +		return 0;
> +	}
> +
> +	bctx->found++;
> +	found->found_refs++;
> +	if (ino < found->ino) {
> +		found->ino = ino;
> +		found->offset = offset;

whereas here only the lowest ino is kept. Why?

> +	} else if (found->ino == ino) {
> +		/*
> +		 * same extent found more then once in the same file.
> +		 */
> +		if (found->offset > offset + bctx->extent_len)
> +			found->offset = offset;

This is unclear to me. Seems to mean something like
'find the lowest offset', but not exactly. Some explaination
would be good.

> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * path must point to the extent item when called.
> + */

What is the purpose of this function? I probably will figure it out
when reading on, but a comment would be nice here.

> +static int find_extent_clone(struct send_ctx *sctx,
> +			     struct btrfs_path *path,
> +			     u64 ino, u64 data_offset,
> +			     u64 ino_size,
> +			     struct clone_root **found)
> +{
> +	int ret;
> +	int extent_type;
> +	u64 logical;
> +	u64 num_bytes;
> +	u64 extent_item_pos;
> +	struct btrfs_file_extent_item *fi;
> +	struct extent_buffer *eb = path->nodes[0];
> +	struct backref_ctx backref_ctx;

currently it's still small enough to keep in on stack, maybe a
comment in struct backref_ctx that it is kept on stack would be
nice.

> +	struct clone_root *cur_clone_root;
> +	struct btrfs_key found_key;
> +	struct btrfs_path *tmp_path;
> +	u32 i;
> +
> +	tmp_path = alloc_path_for_send();
> +	if (!tmp_path)
> +		return -ENOMEM;
> +
> +	if (data_offset >= ino_size) {
> +		/*
> +		 * There may be extents that lie behind the file's size.
> +		 * I at least had this in combination with snapshotting while
> +		 * writing large files.
> +		 */
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	fi = btrfs_item_ptr(eb, path->slots[0],
> +			struct btrfs_file_extent_item);
> +	extent_type = btrfs_file_extent_type(eb, fi);
> +	if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +
> +	num_bytes = btrfs_file_extent_num_bytes(eb, fi);
> +	logical = btrfs_file_extent_disk_bytenr(eb, fi);
> +	if (logical == 0) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +	logical += btrfs_file_extent_offset(eb, fi);
> +
> +	ret = extent_from_logical(sctx->send_root->fs_info,
> +			logical, tmp_path, &found_key);
> +	btrfs_release_path(tmp_path);
> +
> +	if (ret < 0)
> +		goto out;
> +	if (ret & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
> +		ret = -EIO;
> +		goto out;
> +	}
> +
> +	/*
> +	 * Setup the clone roots.
> +	 */
> +	for (i = 0; i < sctx->clone_roots_cnt; i++) {
> +		cur_clone_root = sctx->clone_roots + i;
> +		cur_clone_root->ino = (u64)-1;
> +		cur_clone_root->offset = 0;
> +		cur_clone_root->found_refs = 0;
> +	}
> +
> +	backref_ctx.sctx = sctx;
> +	backref_ctx.found = 0;
> +	backref_ctx.cur_objectid = ino;
> +	backref_ctx.cur_offset = data_offset;
> +	backref_ctx.found_in_send_root = 0;
> +	backref_ctx.extent_len = num_bytes;
> +
> +	/*
> +	 * The last extent of a file may be too large due to page alignment.
> +	 * We need to adjust extent_len in this case so that the checks in
> +	 * __iterate_backrefs work.
> +	 */
> +	if (data_offset + num_bytes >= ino_size)
> +		backref_ctx.extent_len = ino_size - data_offset;
> +
> +	/*
> +	 * Now collect all backrefs.
> +	 */
> +	extent_item_pos = logical - found_key.objectid;
> +	ret = iterate_extent_inodes(sctx->send_root->fs_info,
> +					found_key.objectid, extent_item_pos, 1,
> +					__iterate_backrefs, &backref_ctx);
> +	if (ret < 0)
> +		goto out;
> +
> +	if (!backref_ctx.found_in_send_root) {
> +		/* found a bug in backref code? */
> +		ret = -EIO;
> +		printk(KERN_ERR "btrfs: ERROR did not find backref in "
> +				"send_root. inode=%llu, offset=%llu, "
> +				"logical=%llu\n",
> +				ino, data_offset, logical);
> +		goto out;
> +	}
> +
> +verbose_printk(KERN_DEBUG "btrfs: find_extent_clone: data_offset=%llu, "
> +		"ino=%llu, "
> +		"num_bytes=%llu, logical=%llu\n",
> +		data_offset, ino, num_bytes, logical);
> +
> +	if (!backref_ctx.found)
> +		verbose_printk("btrfs:    no clones found\n");
> +
> +	cur_clone_root = NULL;
> +	for (i = 0; i < sctx->clone_roots_cnt; i++) {
> +		if (sctx->clone_roots[i].found_refs) {
> +			if (!cur_clone_root)
> +				cur_clone_root = sctx->clone_roots + i;
> +			else if (sctx->clone_roots[i].root == sctx->send_root)
> +				/* prefer clones from send_root over others */
> +				cur_clone_root = sctx->clone_roots + i;
> +			break;

If you break after the first found ref, you might miss the send_root.

> +		}
> +
> +	}
> +
> +	if (cur_clone_root) {
> +		*found = cur_clone_root;
> +		ret = 0;
> +	} else {
> +		ret = -ENOENT;
> +	}
> +
> +out:
> +	btrfs_free_path(tmp_path);
> +	return ret;
> +}
> +
> +static int read_symlink(struct send_ctx *sctx,
> +			struct btrfs_root *root,
> +			u64 ino,
> +			struct fs_path *dest)
> +{
> +	int ret;
> +	struct btrfs_path *path;
> +	struct btrfs_key key;
> +	struct btrfs_file_extent_item *ei;
> +	u8 type;
> +	u8 compression;
> +	unsigned long off;
> +	int len;
> +
> +	path = alloc_path_for_send();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	key.objectid = ino;
> +	key.type = BTRFS_EXTENT_DATA_KEY;
> +	key.offset = 0;
> +	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> +	if (ret < 0)
> +		goto out;
> +	BUG_ON(ret);
> +
> +	ei = btrfs_item_ptr(path->nodes[0], path->slots[0],
> +			struct btrfs_file_extent_item);
> +	type = btrfs_file_extent_type(path->nodes[0], ei);
> +	compression = btrfs_file_extent_compression(path->nodes[0], ei);
> +	BUG_ON(type != BTRFS_FILE_EXTENT_INLINE);
> +	BUG_ON(compression);
> +
> +	off = btrfs_file_extent_inline_start(ei);
> +	len = btrfs_file_extent_inline_len(path->nodes[0], ei);
> +
> +	ret = fs_path_add_from_extent_buffer(dest, path->nodes[0], off, len);
> +	if (ret < 0)
> +		goto out;

superfluous

> +
> +out:
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
> +/*
> + * Helper function to generate a file name that is unique in the root of
> + * send_root and parent_root. This is used to generate names for orphan inodes.
> + */
> +static int gen_unique_name(struct send_ctx *sctx,
> +			   u64 ino, u64 gen,
> +			   struct fs_path *dest)
> +{
> +	int ret = 0;
> +	struct btrfs_path *path;
> +	struct btrfs_dir_item *di;
> +	char tmp[64];
> +	int len;
> +	u64 idx = 0;
> +
> +	path = alloc_path_for_send();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	while (1) {
> +		len = snprintf(tmp, sizeof(tmp) - 1, "o%llu-%llu-%llu",
> +				ino, gen, idx);

wouldn't it be easier to just take a uuid? This would save you a lot
of code and especially the need to verify that the name is really
unique, saving seeks.

> +		if (len >= sizeof(tmp)) {
> +			/* should really not happen */
> +			ret = -EOVERFLOW;
> +			goto out;
> +		}
> +
> +		di = btrfs_lookup_dir_item(NULL, sctx->send_root,
> +				path, BTRFS_FIRST_FREE_OBJECTID,
> +				tmp, strlen(tmp), 0);
> +		btrfs_release_path(path);
> +		if (IS_ERR(di)) {
> +			ret = PTR_ERR(di);
> +			goto out;
> +		}
> +		if (di) {
> +			/* not unique, try again */
> +			idx++;
> +			continue;
> +		}
> +
> +		if (!sctx->parent_root) {
> +			/* unique */
> +			ret = 0;
> +			break;
> +		}
> +
> +		di = btrfs_lookup_dir_item(NULL, sctx->parent_root,
> +				path, BTRFS_FIRST_FREE_OBJECTID,
> +				tmp, strlen(tmp), 0);
> +		btrfs_release_path(path);
> +		if (IS_ERR(di)) {
> +			ret = PTR_ERR(di);
> +			goto out;
> +		}
> +		if (di) {
> +			/* not unique, try again */
> +			idx++;
> +			continue;
> +		}
> +		/* unique */
> +		break;
> +	}
> +
> +	ret = fs_path_add(dest, tmp, strlen(tmp));
> +
> +out:
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
> +enum inode_state {
> +	inode_state_no_change,
> +	inode_state_will_create,
> +	inode_state_did_create,
> +	inode_state_will_delete,
> +	inode_state_did_delete,
> +};
> +
> +static int get_cur_inode_state(struct send_ctx *sctx, u64 ino, u64 gen)

don't you want to return a enum inode_state instead of int?

> +{
> +	int ret;
> +	int left_ret;
> +	int right_ret;
> +	u64 left_gen;
> +	u64 right_gen;
> +
> +	ret = get_inode_info(sctx->send_root, ino, NULL, &left_gen, NULL, NULL,
> +			NULL);
> +	if (ret < 0 && ret != -ENOENT)
> +		goto out;
> +	left_ret = ret;
> +
> +	if (!sctx->parent_root) {
> +		right_ret = -ENOENT;
> +	} else {
> +		ret = get_inode_info(sctx->parent_root, ino, NULL, &right_gen,
> +				NULL, NULL, NULL);
> +		if (ret < 0 && ret != -ENOENT)
> +			goto out;
> +		right_ret = ret;
> +	}
> +
> +	if (!left_ret && !right_ret) {
> +		if (left_gen == gen && right_gen == gen)

Please also use {} here

> +			ret = inode_state_no_change;
> +		else if (left_gen == gen) {
> +			if (ino < sctx->send_progress)
> +				ret = inode_state_did_create;
> +			else
> +				ret = inode_state_will_create;
> +		} else if (right_gen == gen) {
> +			if (ino < sctx->send_progress)
> +				ret = inode_state_did_delete;
> +			else
> +				ret = inode_state_will_delete;
> +		} else  {
> +			ret = -ENOENT;
> +		}
> +	} else if (!left_ret) {
> +		if (left_gen == gen) {
> +			if (ino < sctx->send_progress)
> +				ret = inode_state_did_create;
> +			else
> +				ret = inode_state_will_create;
> +		} else {
> +			ret = -ENOENT;
> +		}
> +	} else if (!right_ret) {
> +		if (right_gen == gen) {
> +			if (ino < sctx->send_progress)
> +				ret = inode_state_did_delete;
> +			else
> +				ret = inode_state_will_delete;
> +		} else {
> +			ret = -ENOENT;
> +		}
> +	} else {
> +		ret = -ENOENT;
> +	}
> +
> +out:
> +	return ret;
> +}
> +
> +static int is_inode_existent(struct send_ctx *sctx, u64 ino, u64 gen)
> +{
> +	int ret;
> +
> +	ret = get_cur_inode_state(sctx, ino, gen);
> +	if (ret < 0)
> +		goto out;
> +
> +	if (ret == inode_state_no_change ||
> +	    ret == inode_state_did_create ||
> +	    ret == inode_state_will_delete)
> +		ret = 1;
> +	else
> +		ret = 0;
> +
> +out:
> +	return ret;
> +}
> +
> +/*
> + * Helper function to lookup a dir item in a dir.
> + */
> +static int lookup_dir_item_inode(struct btrfs_root *root,
> +				 u64 dir, const char *name, int name_len,
> +				 u64 *found_inode,
> +				 u8 *found_type)
> +{
> +	int ret = 0;
> +	struct btrfs_dir_item *di;
> +	struct btrfs_key key;
> +	struct btrfs_path *path;
> +
> +	path = alloc_path_for_send();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	di = btrfs_lookup_dir_item(NULL, root, path,
> +			dir, name, name_len, 0);
> +	if (!di) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +	if (IS_ERR(di)) {
> +		ret = PTR_ERR(di);
> +		goto out;
> +	}
> +	btrfs_dir_item_key_to_cpu(path->nodes[0], di, &key);
> +	*found_inode = key.objectid;
> +	*found_type = btrfs_dir_type(path->nodes[0], di);
> +
> +out:
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
> +static int get_first_ref(struct send_ctx *sctx,

The name does not reflect well what the function does.
It's more like get_first_parent_dir or get_first_inode_ref

> +			 struct btrfs_root *root, u64 ino,
> +			 u64 *dir, u64 *dir_gen, struct fs_path *name)
> +{
> +	int ret;
> +	struct btrfs_key key;
> +	struct btrfs_key found_key;
> +	struct btrfs_path *path;
> +	struct btrfs_inode_ref *iref;
> +	int len;
> +
> +	path = alloc_path_for_send();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	key.objectid = ino;
> +	key.type = BTRFS_INODE_REF_KEY;
> +	key.offset = 0;
> +
> +	ret = btrfs_search_slot_for_read(root, &key, path, 1, 0);
> +	if (ret < 0)
> +		goto out;
> +	if (!ret)
> +		btrfs_item_key_to_cpu(path->nodes[0], &found_key,
> +				path->slots[0]);
> +	if (ret || found_key.objectid != key.objectid ||
> +	    found_key.type != key.type) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +
> +	iref = btrfs_item_ptr(path->nodes[0], path->slots[0],
> +			struct btrfs_inode_ref);
> +	len = btrfs_inode_ref_name_len(path->nodes[0], iref);
> +	ret = fs_path_add_from_extent_buffer(name, path->nodes[0],
> +			(unsigned long)(iref + 1), len);
> +	if (ret < 0)
> +		goto out;
> +	btrfs_release_path(path);
> +
> +	ret = get_inode_info(root, found_key.offset, NULL, dir_gen, NULL, NULL,
> +			NULL);
> +	if (ret < 0)
> +		goto out;
> +
> +	*dir = found_key.offset;
> +
> +out:
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
> +static int is_first_ref(struct send_ctx *sctx,
> +			struct btrfs_root *root,
> +			u64 ino, u64 dir,
> +			const char *name, int name_len)
> +{
> +	int ret;
> +	struct fs_path *tmp_name;
> +	u64 tmp_dir;
> +	u64 tmp_dir_gen;
> +
> +	tmp_name = fs_path_alloc(sctx);
> +	if (!tmp_name)
> +		return -ENOMEM;
> +
> +	ret = get_first_ref(sctx, root, ino, &tmp_dir, &tmp_dir_gen, tmp_name);
> +	if (ret < 0)
> +		goto out;
> +
> +	if (name_len != fs_path_len(tmp_name)) {
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	ret = memcmp(tmp_name->start, name, name_len);

or just ret = !memcmp...?

> +	if (ret)
> +		ret = 0;
> +	else
> +		ret = 1;
> +
> +out:
> +	fs_path_free(sctx, tmp_name);
> +	return ret;
> +}
> +
> +static int will_overwrite_ref(struct send_ctx *sctx, u64 dir, u64 dir_gen,
> +			      const char *name, int name_len,
> +			      u64 *who_ino, u64 *who_gen)
> +{
> +	int ret = 0;
> +	u64 other_inode = 0;
> +	u8 other_type = 0;
> +
> +	if (!sctx->parent_root)
> +		goto out;
> +
> +	ret = is_inode_existent(sctx, dir, dir_gen);
> +	if (ret <= 0)
> +		goto out;
> +
> +	ret = lookup_dir_item_inode(sctx->parent_root, dir, name, name_len,
> +			&other_inode, &other_type);
> +	if (ret < 0 && ret != -ENOENT)
> +		goto out;
> +	if (ret) {
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	if (other_inode > sctx->send_progress) {

I haven't really grasped what this function does (a comment would be
nice), but I have a feeling that renames might break things when the
parent is not a direct ancenstor. Maybe it gets clearer when I read
on ;)

> +		ret = get_inode_info(sctx->parent_root, other_inode, NULL,
> +				who_gen, NULL, NULL, NULL);
> +		if (ret < 0)
> +			goto out;
> +
> +		ret = 1;
> +		*who_ino = other_inode;
> +	} else {
> +		ret = 0;
> +	}
> +
> +out:
> +	return ret;
> +}
> +
> +static int did_overwrite_ref(struct send_ctx *sctx,
> +			    u64 dir, u64 dir_gen,
> +			    u64 ino, u64 ino_gen,
> +			    const char *name, int name_len)
> +{
> +	int ret = 0;
> +	u64 gen;
> +	u64 ow_inode;
> +	u8 other_type;
> +
> +	if (!sctx->parent_root)
> +		goto out;
> +
> +	ret = is_inode_existent(sctx, dir, dir_gen);
> +	if (ret <= 0)
> +		goto out;
> +
> +	/* check if the ref was overwritten by another ref */
> +	ret = lookup_dir_item_inode(sctx->send_root, dir, name, name_len,
> +			&ow_inode, &other_type);
> +	if (ret < 0 && ret != -ENOENT)
> +		goto out;
> +	if (ret) {
> +		/* was never and will never be overwritten */
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	ret = get_inode_info(sctx->send_root, ow_inode, NULL, &gen, NULL, NULL,
> +			NULL);
> +	if (ret < 0)
> +		goto out;
> +
> +	if (ow_inode == ino && gen == ino_gen) {
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	/* we know that it is or will be overwritten. check this now */
> +	if (ow_inode < sctx->send_progress)
> +		ret = 1;
> +	else
> +		ret = 0;
> +
> +out:
> +	return ret;
> +}
> +
> +static int did_overwrite_first_ref(struct send_ctx *sctx, u64 ino, u64 gen)
> +{
> +	int ret = 0;
> +	struct fs_path *name = NULL;
> +	u64 dir;
> +	u64 dir_gen;
> +
> +	if (!sctx->parent_root)
> +		goto out;
> +
> +	name = fs_path_alloc(sctx);
> +	if (!name)
> +		return -ENOMEM;
> +
> +	ret = get_first_ref(sctx, sctx->parent_root, ino, &dir, &dir_gen, name);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = did_overwrite_ref(sctx, dir, dir_gen, ino, gen,
> +			name->start, fs_path_len(name));

> +	if (ret < 0)
> +		goto out;

superfluous

> +
> +out:
> +	fs_path_free(sctx, name);
> +	return ret;
> +}
> +
> +static int name_cache_insert(struct send_ctx *sctx,
> +			     struct name_cache_entry *nce)
> +{
> +	int ret = 0;
> +	struct name_cache_entry **ncea;
> +
> +	ncea = radix_tree_lookup(&sctx->name_cache, nce->ino);

attention: radix_trees take an unsigned long as index, and ino
is a u64. You're in trouble on 32 bit.

> +	if (ncea) {
> +		if (!ncea[0])
> +			ncea[0] = nce;
> +		else if (!ncea[1])
> +			ncea[1] = nce;
> +		else
> +			BUG();
> +	} else {
> +		ncea = kmalloc(sizeof(void *) * 2, GFP_NOFS);
> +		if (!ncea)
> +			return -ENOMEM;
> +
> +		ncea[0] = nce;
> +		ncea[1] = NULL;
> +		ret = radix_tree_insert(&sctx->name_cache, nce->ino, ncea);
> +		if (ret < 0)
> +			return ret;
> +	}
> +	list_add_tail(&nce->list, &sctx->name_cache_list);
> +	sctx->name_cache_size++;
> +
> +	return ret;
> +}
> +
> +static void name_cache_delete(struct send_ctx *sctx,
> +			      struct name_cache_entry *nce)
> +{
> +	struct name_cache_entry **ncea;
> +
> +	ncea = radix_tree_lookup(&sctx->name_cache, nce->ino);
> +	BUG_ON(!ncea);
> +
> +	if (ncea[0] == nce)
> +		ncea[0] = NULL;
> +	else if (ncea[1] == nce)
> +		ncea[1] = NULL;
> +	else
> +		BUG();
> +
> +	if (!ncea[0] && !ncea[1]) {
> +		radix_tree_delete(&sctx->name_cache, nce->ino);
> +		kfree(ncea);
> +	}
> +
> +	list_del(&nce->list);
> +
> +	sctx->name_cache_size--;
> +}
> +
> +static struct name_cache_entry *name_cache_search(struct send_ctx *sctx,
> +						    u64 ino, u64 gen)
> +{
> +	struct name_cache_entry **ncea;
> +
> +	ncea = radix_tree_lookup(&sctx->name_cache, ino);
> +	if (!ncea)
> +		return NULL;
> +
> +	if (ncea[0] && ncea[0]->gen == gen)
> +		return ncea[0];
> +	else if (ncea[1] && ncea[1]->gen == gen)
> +		return ncea[1];
> +	return NULL;
> +}
> +
> +static void name_cache_used(struct send_ctx *sctx, struct name_cache_entry *nce)
> +{
> +	list_del(&nce->list);
> +	list_add_tail(&nce->list, &sctx->name_cache_list);
> +}
> +
> +static void name_cache_clean_unused(struct send_ctx *sctx)
> +{
> +	struct name_cache_entry *nce;
> +
> +	if (sctx->name_cache_size < SEND_CTX_NAME_CACHE_CLEAN_SIZE)
> +		return;

superfluous, the while condition below is enough.

> +
> +	while (sctx->name_cache_size > SEND_CTX_MAX_NAME_CACHE_SIZE) {
> +		nce = list_entry(sctx->name_cache_list.next,
> +				struct name_cache_entry, list);
> +		name_cache_delete(sctx, nce);
> +		kfree(nce);
> +	}
> +}
> +
> +static void name_cache_free(struct send_ctx *sctx)
> +{
> +	struct name_cache_entry *nce;
> +	struct name_cache_entry *tmp;
> +
> +	list_for_each_entry_safe(nce, tmp, &sctx->name_cache_list, list) {

it's easier to just always delete the head until the list is empty.
Saves you the tmp-var.

> +		name_cache_delete(sctx, nce);
> +	}
> +}
> +
> +static int __get_cur_name_and_parent(struct send_ctx *sctx,
> +				     u64 ino, u64 gen,
> +				     u64 *parent_ino,
> +				     u64 *parent_gen,
> +				     struct fs_path *dest)
> +{
> +	int ret;
> +	int nce_ret;
> +	struct btrfs_path *path = NULL;
> +	struct name_cache_entry *nce = NULL;
> +
> +	nce = name_cache_search(sctx, ino, gen);
> +	if (nce) {
> +		if (ino < sctx->send_progress && nce->need_later_update) {
> +			name_cache_delete(sctx, nce);
> +			kfree(nce);
> +			nce = NULL;
> +		} else {
> +			name_cache_used(sctx, nce);
> +			*parent_ino = nce->parent_ino;
> +			*parent_gen = nce->parent_gen;
> +			ret = fs_path_add(dest, nce->name, nce->name_len);
> +			if (ret < 0)
> +				goto out;
> +			ret = nce->ret;
> +			goto out;
> +		}
> +	}
> +
> +	path = alloc_path_for_send();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	ret = is_inode_existent(sctx, ino, gen);
> +	if (ret < 0)
> +		goto out;
> +
> +	if (!ret) {
> +		ret = gen_unique_name(sctx, ino, gen, dest);
> +		if (ret < 0)
> +			goto out;
> +		ret = 1;
> +		goto out_cache;
> +	}
> +
> +	if (ino < sctx->send_progress)
> +		ret = get_first_ref(sctx, sctx->send_root, ino,
> +				parent_ino, parent_gen, dest);
> +	else
> +		ret = get_first_ref(sctx, sctx->parent_root, ino,
> +				parent_ino, parent_gen, dest);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = did_overwrite_ref(sctx, *parent_ino, *parent_gen, ino, gen,
> +			dest->start, dest->end - dest->start);
> +	if (ret < 0)
> +		goto out;
> +	if (ret) {
> +		fs_path_reset(dest);
> +		ret = gen_unique_name(sctx, ino, gen, dest);
> +		if (ret < 0)
> +			goto out;
> +		ret = 1;
> +	}
> +
> +out_cache:
> +	nce = kmalloc(sizeof(*nce) + fs_path_len(dest) + 1, GFP_NOFS);
> +	if (!nce) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	nce->ino = ino;
> +	nce->gen = gen;
> +	nce->parent_ino = *parent_ino;
> +	nce->parent_gen = *parent_gen;
> +	nce->name_len = fs_path_len(dest);
> +	nce->ret = ret;

This is a bit too magic for me. ret == 1 iff it's a unique_name?

> +	strcpy(nce->name, dest->start);
> +	memset(&nce->use_list, 0, sizeof(nce->use_list));

use_list is unused, anyway, it's a strange way to initialize a
list_head. There's the INIT_LIST_HEAD macro.

> +
> +	if (ino < sctx->send_progress)
> +		nce->need_later_update = 0;
> +	else
> +		nce->need_later_update = 1;
> +
> +	nce_ret = name_cache_insert(sctx, nce);
> +	if (nce_ret < 0)
> +		ret = nce_ret;
> +	name_cache_clean_unused(sctx);
> +
> +out:
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
> +/*
> + * Magic happens here. This function returns the first ref to an inode as it
> + * would look like while receiving the stream at this point in time.
> + * We walk the path up to the root. For every inode in between, we check if it
> + * was already processed/sent. If yes, we continue with the parent as found
> + * in send_root. If not, we continue with the parent as found in parent_root.
> + * If we encounter an inode that was deleted at this point in time, we use the
> + * inodes "orphan" name instead of the real name and stop. Same with new inodes
> + * that were not created yet and overwritten inodes/refs.
> + *
> + * When do we have have orphan inodes:
> + * 1. When an inode is freshly created and thus no valid refs are available yet
> + * 2. When a directory lost all it's refs (deleted) but still has dir items
> + *    inside which were not processed yet (pending for move/delete). If anyone
> + *    tried to get the path to the dir items, it would get a path inside that
> + *    orphan directory.
> + * 3. When an inode is moved around or gets new links, it may overwrite the ref
> + *    of an unprocessed inode. If in that case the first ref would be
> + *    overwritten, the overwritten inode gets "orphanized". Later when we
> + *    process this overwritten inode, it is restored at a new place by moving
> + *    the orphan inode.
> + *
> + * sctx->send_progress tells this function at which point in time receiving
> + * would be.
> + */

Thanks for the comment :)

> +static int get_cur_path(struct send_ctx *sctx, u64 ino, u64 gen,
> +			struct fs_path *dest)
> +{
> +	int ret = 0;
> +	struct fs_path *name = NULL;
> +	u64 parent_inode = 0;
> +	u64 parent_gen = 0;
> +	int stop = 0;
> +
> +	name = fs_path_alloc(sctx);
> +	if (!name) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	dest->reversed = 1;
> +	fs_path_reset(dest);
> +
> +	while (!stop && ino != BTRFS_FIRST_FREE_OBJECTID) {
> +		fs_path_reset(name);
> +
> +		ret = __get_cur_name_and_parent(sctx, ino, gen,
> +				&parent_inode, &parent_gen, name);
> +		if (ret < 0)
> +			goto out;
> +		if (ret)
> +			stop = 1;
> +
> +		ret = fs_path_add_path(dest, name);
> +		if (ret < 0)
> +			goto out;
> +
> +		ino = parent_inode;
> +		gen = parent_gen;
> +	}
> +
> +out:
> +	fs_path_free(sctx, name);
> +	if (!ret)
> +		fs_path_unreverse(dest);
> +	return ret;
> +}
> +
> +/*
> + * Called for regular files when sending extents data. Opens a struct file
> + * to read from the file.
> + */
> +static int open_cur_inode_file(struct send_ctx *sctx)
> +{
> +	int ret = 0;
> +	struct btrfs_key key;
> +	struct vfsmount *mnt;
> +	struct inode *inode;
> +	struct dentry *dentry;
> +	struct file *filp;
> +	int new = 0;
> +
> +	if (sctx->cur_inode_filp)
> +		goto out;
> +
> +	key.objectid = sctx->cur_ino;
> +	key.type = BTRFS_INODE_ITEM_KEY;
> +	key.offset = 0;
> +
> +	inode = btrfs_iget(sctx->send_root->fs_info->sb, &key, sctx->send_root,
> +			&new);
> +	if (IS_ERR(inode)) {
> +		ret = PTR_ERR(inode);
> +		goto out;
> +	}
> +
> +	dentry = d_obtain_alias(inode);
> +	inode = NULL;
> +	if (IS_ERR(dentry)) {
> +		ret = PTR_ERR(dentry);
> +		goto out;
> +	}
> +
> +	mnt = mntget(sctx->mnt);
> +	filp = dentry_open(dentry, mnt, O_RDONLY | O_LARGEFILE, current_cred());
> +	dentry = NULL;
> +	mnt = NULL;

It would be good if this part could be reviewed by someone with
deep vfs knowledge. Maybe you can compile those parts into a
separate patch and send it to the appropriate ppl for review.

> +	if (IS_ERR(filp)) {
> +		ret = PTR_ERR(filp);
> +		goto out;
> +	}
> +	sctx->cur_inode_filp = filp;
> +
> +out:
> +	/*
> +	 * no xxxput required here as every vfs op
> +	 * does it by itself on failure
> +	 */
> +	return ret;
> +}
> +
> +/*
> + * Closes the struct file that was created in open_cur_inode_file
> + */
> +static int close_cur_inode_file(struct send_ctx *sctx)
> +{
> +	int ret = 0;
> +
> +	if (!sctx->cur_inode_filp)
> +		goto out;
> +
> +	ret = filp_close(sctx->cur_inode_filp, NULL);
> +	sctx->cur_inode_filp = NULL;
> +
> +out:
> +	return ret;
> +}
> +
> +/*
> + * Sends a BTRFS_SEND_C_SUBVOL command/item to userspace
> + */
> +static int send_subvol_begin(struct send_ctx *sctx)
> +{
> +	int ret;
> +	struct btrfs_root *send_root = sctx->send_root;
> +	struct btrfs_root *parent_root = sctx->parent_root;
> +	struct btrfs_path *path;
> +	struct btrfs_key key;
> +	struct btrfs_root_ref *ref;
> +	struct extent_buffer *leaf;
> +	char *name = NULL;
> +	int namelen;
> +
> +	path = alloc_path_for_send();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	name = kmalloc(BTRFS_PATH_NAME_MAX, GFP_NOFS);
> +	if (!name) {
> +		btrfs_free_path(path);
> +		return -ENOMEM;
> +	}
> +
> +	key.objectid = send_root->objectid;
> +	key.type = BTRFS_ROOT_BACKREF_KEY;
> +	key.offset = 0;
> +
> +	ret = btrfs_search_slot_for_read(send_root->fs_info->tree_root,
> +				&key, path, 1, 0);
> +	if (ret < 0)
> +		goto out;
> +	if (ret) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +
> +	leaf = path->nodes[0];
> +	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> +	if (key.type != BTRFS_ROOT_BACKREF_KEY ||
> +	    key.objectid != send_root->objectid) {
> +		ret = -ENOENT;
> +		goto out;
> +	}

It looks like we could use a helper for finding the first entry
with a specific objectid+key...

> +	ref = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_root_ref);
> +	namelen = btrfs_root_ref_name_len(leaf, ref);
> +	read_extent_buffer(leaf, name, (unsigned long)(ref + 1), namelen);
> +	btrfs_release_path(path);
> +
> +	if (ret < 0)
> +		goto out;

How can ret be < 0 here?

> +
> +	if (parent_root) {
> +		ret = begin_cmd(sctx, BTRFS_SEND_C_SNAPSHOT);
> +		if (ret < 0)
> +			goto out;
> +	} else {
> +		ret = begin_cmd(sctx, BTRFS_SEND_C_SUBVOL);
> +		if (ret < 0)
> +			goto out;
> +	}
> +
> +	TLV_PUT_STRING(sctx, BTRFS_SEND_A_PATH, name, namelen);

It's called PATH, but it seems to be only the last path component.
What about subvols that are ancored deeper in the dir tree?

> +	TLV_PUT_UUID(sctx, BTRFS_SEND_A_UUID,
> +			sctx->send_root->root_item.uuid);
> +	TLV_PUT_U64(sctx, BTRFS_SEND_A_CTRANSID,
> +			sctx->send_root->root_item.ctransid);
> +	if (parent_root) {

The name of the parent is not sent?

> +		TLV_PUT_UUID(sctx, BTRFS_SEND_A_CLONE_UUID,
> +				sctx->parent_root->root_item.uuid);
> +		TLV_PUT_U64(sctx, BTRFS_SEND_A_CLONE_CTRANSID,
> +				sctx->parent_root->root_item.ctransid);
> +	}
> +
> +	ret = send_cmd(sctx);
> +
> +tlv_put_failure:
> +out:
> +	btrfs_free_path(path);
> +	kfree(name);
> +	return ret;
> +}
> +
> +static int send_truncate(struct send_ctx *sctx, u64 ino, u64 gen, u64 size)
> +{
> +	int ret = 0;
> +	struct fs_path *p;
> +
> +verbose_printk("btrfs: send_truncate %llu size=%llu\n", ino, size);
> +
> +	p = fs_path_alloc(sctx);
> +	if (!p)
> +		return -ENOMEM;
> +
> +	ret = begin_cmd(sctx, BTRFS_SEND_C_TRUNCATE);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = get_cur_path(sctx, ino, gen, p);
> +	if (ret < 0)
> +		goto out;
> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
> +	TLV_PUT_U64(sctx, BTRFS_SEND_A_SIZE, size);
> +
> +	ret = send_cmd(sctx);
> +
> +tlv_put_failure:
> +out:
> +	fs_path_free(sctx, p);
> +	return ret;
> +}
> +
> +static int send_chmod(struct send_ctx *sctx, u64 ino, u64 gen, u64 mode)
> +{
> +	int ret = 0;
> +	struct fs_path *p;
> +
> +verbose_printk("btrfs: send_chmod %llu mode=%llu\n", ino, mode);
> +
> +	p = fs_path_alloc(sctx);
> +	if (!p)
> +		return -ENOMEM;
> +
> +	ret = begin_cmd(sctx, BTRFS_SEND_C_CHMOD);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = get_cur_path(sctx, ino, gen, p);
> +	if (ret < 0)
> +		goto out;
> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
> +	TLV_PUT_U64(sctx, BTRFS_SEND_A_MODE, mode & 07777);

four 7?

> +
> +	ret = send_cmd(sctx);
> +
> +tlv_put_failure:
> +out:
> +	fs_path_free(sctx, p);
> +	return ret;
> +}
> +
> +static int send_chown(struct send_ctx *sctx, u64 ino, u64 gen, u64 uid, u64 gid)
> +{
> +	int ret = 0;
> +	struct fs_path *p;
> +
> +verbose_printk("btrfs: send_chown %llu uid=%llu, gid=%llu\n", ino, uid, gid);
> +
> +	p = fs_path_alloc(sctx);
> +	if (!p)
> +		return -ENOMEM;
> +
> +	ret = begin_cmd(sctx, BTRFS_SEND_C_CHOWN);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = get_cur_path(sctx, ino, gen, p);
> +	if (ret < 0)
> +		goto out;
> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
> +	TLV_PUT_U64(sctx, BTRFS_SEND_A_UID, uid);
> +	TLV_PUT_U64(sctx, BTRFS_SEND_A_GID, gid);
> +
> +	ret = send_cmd(sctx);
> +
> +tlv_put_failure:
> +out:
> +	fs_path_free(sctx, p);
> +	return ret;
> +}
> +
> +static int send_utimes(struct send_ctx *sctx, u64 ino, u64 gen)
> +{
> +	int ret = 0;
> +	struct fs_path *p = NULL;
> +	struct btrfs_inode_item *ii;
> +	struct btrfs_path *path = NULL;
> +	struct extent_buffer *eb;
> +	struct btrfs_key key;
> +	int slot;
> +
> +verbose_printk("btrfs: send_utimes %llu\n", ino);
> +
> +	p = fs_path_alloc(sctx);
> +	if (!p)
> +		return -ENOMEM;
> +
> +	path = alloc_path_for_send();
> +	if (!path) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	key.objectid = ino;
> +	key.type = BTRFS_INODE_ITEM_KEY;
> +	key.offset = 0;
> +	ret = btrfs_search_slot(NULL, sctx->send_root, &key, path, 0, 0);
> +	if (ret < 0)
> +		goto out;

you don't check for existence. I guess you know it exists, otherwise
you wouldn't end up here...

> +
> +	eb = path->nodes[0];
> +	slot = path->slots[0];
> +	ii = btrfs_item_ptr(eb, slot, struct btrfs_inode_item);
> +
> +	ret = begin_cmd(sctx, BTRFS_SEND_C_UTIMES);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = get_cur_path(sctx, ino, gen, p);
> +	if (ret < 0)
> +		goto out;
> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
> +	TLV_PUT_BTRFS_TIMESPEC(sctx, BTRFS_SEND_A_ATIME, eb,
> +			btrfs_inode_atime(ii));
> +	TLV_PUT_BTRFS_TIMESPEC(sctx, BTRFS_SEND_A_MTIME, eb,
> +			btrfs_inode_mtime(ii));
> +	TLV_PUT_BTRFS_TIMESPEC(sctx, BTRFS_SEND_A_CTIME, eb,
> +			btrfs_inode_ctime(ii));
> +	/* TODO otime? */

yes, please :)

> +
> +	ret = send_cmd(sctx);
> +
> +tlv_put_failure:
> +out:
> +	fs_path_free(sctx, p);
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
> +/*
> + * Sends a BTRFS_SEND_C_MKXXX or SYMLINK command to user space. We don't have
> + * a valid path yet because we did not process the refs yet. So, the inode
> + * is created as orphan.
> + */
> +static int send_create_inode(struct send_ctx *sctx, struct btrfs_path *path,
> +			     struct btrfs_key *key)
> +{
> +	int ret = 0;
> +	struct extent_buffer *eb = path->nodes[0];
> +	struct btrfs_inode_item *ii;
> +	struct fs_path *p;
> +	int slot = path->slots[0];
> +	int cmd;
> +	u64 mode;
> +
> +verbose_printk("btrfs: send_create_inode %llu\n", sctx->cur_ino);
> +
> +	p = fs_path_alloc(sctx);
> +	if (!p)
> +		return -ENOMEM;
> +
> +	ii = btrfs_item_ptr(eb, slot, struct btrfs_inode_item);
> +	mode = btrfs_inode_mode(eb, ii);
> +
> +	if (S_ISREG(mode))
> +		cmd = BTRFS_SEND_C_MKFILE;
> +	else if (S_ISDIR(mode))
> +		cmd = BTRFS_SEND_C_MKDIR;
> +	else if (S_ISLNK(mode))
> +		cmd = BTRFS_SEND_C_SYMLINK;
> +	else if (S_ISCHR(mode) || S_ISBLK(mode))
> +		cmd = BTRFS_SEND_C_MKNOD;
> +	else if (S_ISFIFO(mode))
> +		cmd = BTRFS_SEND_C_MKFIFO;
> +	else if (S_ISSOCK(mode))
> +		cmd = BTRFS_SEND_C_MKSOCK;
> +	else {

normally you'd put {} in all cases if you need it for one.

> +		printk(KERN_WARNING "btrfs: unexpected inode type %o",
> +				(int)(mode & S_IFMT));
> +		ret = -ENOTSUPP;
> +		goto out;
> +	}
> +
> +	ret = begin_cmd(sctx, cmd);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = gen_unique_name(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
> +	if (ret < 0)
> +		goto out;
> +
> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
> +
> +	if (S_ISLNK(mode)) {
> +		fs_path_reset(p);
> +		ret = read_symlink(sctx, sctx->send_root, sctx->cur_ino, p);
> +		if (ret < 0)
> +			goto out;
> +		TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH_LINK, p);
> +	} else if (S_ISCHR(mode) || S_ISBLK(mode) ||
> +		   S_ISFIFO(mode) || S_ISSOCK(mode)) {
> +		TLV_PUT_U64(sctx, BTRFS_SEND_A_RDEV, btrfs_inode_rdev(eb, ii));
> +	}
> +
> +	ret = send_cmd(sctx);
> +	if (ret < 0)
> +		goto out;
> +
> +
> +tlv_put_failure:
> +out:
> +	fs_path_free(sctx, p);
> +	return ret;
> +}
> +
> +struct recorded_ref {
> +	struct list_head list;
> +	char *dir_path;
> +	char *name;
> +	struct fs_path *full_path;
> +	u64 dir;
> +	u64 dir_gen;
> +	int dir_path_len;
> +	int name_len;
> +};
> +
> +/*
> + * We need to process new refs before deleted refs, but compare_tree gives us
> + * everything mixed. So we first record all refs and later process them.
> + * This function is a helper to record one ref.
> + */
> +static int record_ref(struct list_head *head, u64 dir,
> +		      u64 dir_gen, struct fs_path *path)
> +{
> +	struct recorded_ref *ref;
> +	char *tmp;
> +
> +	ref = kmalloc(sizeof(*ref), GFP_NOFS);
> +	if (!ref)
> +		return -ENOMEM;
> +
> +	ref->dir = dir;
> +	ref->dir_gen = dir_gen;
> +	ref->full_path = path;
> +
> +	tmp = strrchr(ref->full_path->start, '/');
> +	if (!tmp) {
> +		ref->name_len = ref->full_path->end - ref->full_path->start;
> +		ref->name = ref->full_path->start;
> +		ref->dir_path_len = 0;
> +		ref->dir_path = ref->full_path->start;
> +	} else {
> +		tmp++;
> +		ref->name_len = ref->full_path->end - tmp;
> +		ref->name = tmp;
> +		ref->dir_path = ref->full_path->start;
> +		ref->dir_path_len = ref->full_path->end -
> +				ref->full_path->start - 1 - ref->name_len;
> +	}
> +
> +	list_add_tail(&ref->list, head);
> +	return 0;
> +}
> +
> +static void __free_recorded_refs(struct send_ctx *sctx, struct list_head *head)
> +{
> +	struct recorded_ref *cur;
> +	struct recorded_ref *tmp;
> +
> +	list_for_each_entry_safe(cur, tmp, head, list) {
> +		fs_path_free(sctx, cur->full_path);
> +		kfree(cur);
> +	}
> +	INIT_LIST_HEAD(head);

This is a bit non-obvious. You use the _safe-macro as if you're
going to delete each entry, but then you don't delete it and
instead just reset the head. I'd prefer a while(!list_empty())-
list_del-loop here.

> +}
> +
> +static void free_recorded_refs(struct send_ctx *sctx)
> +{
> +	__free_recorded_refs(sctx, &sctx->new_refs);
> +	__free_recorded_refs(sctx, &sctx->deleted_refs);
> +}
> +
> +/*
> + * Renames/moves a file/dir to it's orphan name. Used when the first
                                  its

> + * ref of an unprocessed inode gets overwritten and for all non empty
> + * directories.
> + */
> +static int orphanize_inode(struct send_ctx *sctx, u64 ino, u64 gen,
> +			  struct fs_path *path)
> +{
> +	int ret;
> +	struct fs_path *orphan;
> +
> +	orphan = fs_path_alloc(sctx);
> +	if (!orphan)
> +		return -ENOMEM;
> +
> +	ret = gen_unique_name(sctx, ino, gen, orphan);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = send_rename(sctx, path, orphan);
> +
> +out:
> +	fs_path_free(sctx, orphan);
> +	return ret;
> +}
> +
> +/*
> + * Returns 1 if a directory can be removed at this point in time.
> + * We check this by iterating all dir items and checking if the inode behind
> + * the dir item was already processed.
> + */
> +static int can_rmdir(struct send_ctx *sctx, u64 dir, u64 send_progress)
> +{
> +	int ret = 0;
> +	struct btrfs_root *root = sctx->parent_root;
> +	struct btrfs_path *path;
> +	struct btrfs_key key;
> +	struct btrfs_key found_key;
> +	struct btrfs_key loc;
> +	struct btrfs_dir_item *di;
> +
> +	path = alloc_path_for_send();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	key.objectid = dir;
> +	key.type = BTRFS_DIR_INDEX_KEY;
> +	key.offset = 0;
> +
> +	while (1) {
> +		ret = btrfs_search_slot_for_read(root, &key, path, 1, 0);
> +		if (ret < 0)
> +			goto out;
> +		if (!ret) {
> +			btrfs_item_key_to_cpu(path->nodes[0], &found_key,
> +					path->slots[0]);
> +		}
> +		if (ret || found_key.objectid != key.objectid ||
> +		    found_key.type != key.type) {
> +			break;
> +		}

another case for the above mentioned helper...

> +
> +		di = btrfs_item_ptr(path->nodes[0], path->slots[0],
> +				struct btrfs_dir_item);
> +		btrfs_dir_item_key_to_cpu(path->nodes[0], di, &loc);
> +
> +		if (loc.objectid > send_progress) {
> +			ret = 0;
> +			goto out;
> +		}
> +
> +		btrfs_release_path(path);
> +		key.offset = found_key.offset + 1;
> +	}
> +
> +	ret = 1;
> +
> +out:
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
> +/*
> + * This does all the move/link/unlink/rmdir magic.
> + */
> +static int process_recorded_refs(struct send_ctx *sctx)
> +{
> +	int ret = 0;
> +	struct recorded_ref *cur;
> +	struct ulist *check_dirs = NULL;
> +	struct ulist_iterator uit;
> +	struct ulist_node *un;
> +	struct fs_path *valid_path = NULL;
> +	u64 ow_inode;
> +	u64 ow_gen;
> +	int did_overwrite = 0;
> +	int is_orphan = 0;
> +
> +verbose_printk("btrfs: process_recorded_refs %llu\n", sctx->cur_ino);
> +
> +	valid_path = fs_path_alloc(sctx);
> +	if (!valid_path) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	check_dirs = ulist_alloc(GFP_NOFS);
> +	if (!check_dirs) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	/*
> +	 * First, check if the first ref of the current inode was overwritten
> +	 * before. If yes, we know that the current inode was already orphanized
> +	 * and thus use the orphan name. If not, we can use get_cur_path to
> +	 * get the path of the first ref as it would like while receiving at
> +	 * this point in time.
> +	 * New inodes are always orphan at the beginning, so force to use the
> +	 * orphan name in this case.
> +	 * The first ref is stored in valid_path and will be updated if it
> +	 * gets moved around.
> +	 */
> +	if (!sctx->cur_inode_new) {
> +		ret = did_overwrite_first_ref(sctx, sctx->cur_ino,
> +				sctx->cur_inode_gen);
> +		if (ret < 0)
> +			goto out;
> +		if (ret)
> +			did_overwrite = 1;
> +	}
> +	if (sctx->cur_inode_new || did_overwrite) {
> +		ret = gen_unique_name(sctx, sctx->cur_ino,
> +				sctx->cur_inode_gen, valid_path);
> +		if (ret < 0)
> +			goto out;
> +		is_orphan = 1;
> +	} else {
> +		ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen,
> +				valid_path);
> +		if (ret < 0)
> +			goto out;
> +	}
> +
> +	list_for_each_entry(cur, &sctx->new_refs, list) {
> +		/*
> +		 * Check if this new ref would overwrite the first ref of
> +		 * another unprocessed inode. If yes, orphanize the
> +		 * overwritten inode. If we find an overwritten ref that is
> +		 * not the first ref, simply unlink it.
> +		 */
> +		ret = will_overwrite_ref(sctx, cur->dir, cur->dir_gen,
> +				cur->name, cur->name_len,
> +				&ow_inode, &ow_gen);
> +		if (ret < 0)
> +			goto out;
> +		if (ret) {
> +			ret = is_first_ref(sctx, sctx->parent_root,
> +					ow_inode, cur->dir, cur->name,
> +					cur->name_len);
> +			if (ret < 0)
> +				goto out;
> +			if (ret) {
> +				ret = orphanize_inode(sctx, ow_inode, ow_gen,
> +						cur->full_path);
> +				if (ret < 0)
> +					goto out;
> +			} else {
> +				ret = send_unlink(sctx, cur->full_path);
> +				if (ret < 0)
> +					goto out;
> +			}
> +		}
> +
> +		/*
> +		 * link/move the ref to the new place. If we have an orphan
> +		 * inode, move it and update valid_path. If not, link or move
> +		 * it depending on the inode mode.
> +		 */
> +		if (is_orphan) {
> +			ret = send_rename(sctx, valid_path, cur->full_path);
> +			if (ret < 0)
> +				goto out;
> +			is_orphan = 0;
> +			ret = fs_path_copy(valid_path, cur->full_path);
> +			if (ret < 0)
> +				goto out;
> +		} else {
> +			if (S_ISDIR(sctx->cur_inode_mode)) {

why not save a level of indentation here by using <else if>?

> +				/*
> +				 * Dirs can't be linked, so move it. For moved
> +				 * dirs, we always have one new and one deleted
> +				 * ref. The deleted ref is ignored later.
> +				 */
> +				ret = send_rename(sctx, valid_path,
> +						cur->full_path);
> +				if (ret < 0)
> +					goto out;
> +				ret = fs_path_copy(valid_path, cur->full_path);
> +				if (ret < 0)
> +					goto out;
> +			} else {
> +				ret = send_link(sctx, valid_path,
> +						cur->full_path);
> +				if (ret < 0)
> +					goto out;
> +			}
> +		}
> +		ret = ulist_add(check_dirs, cur->dir, cur->dir_gen,

careful, aux is only an unsigned long, meant to be as large as a pointer.

> +				GFP_NOFS);
> +		if (ret < 0)
> +			goto out;
> +	}
> +
> +	if (S_ISDIR(sctx->cur_inode_mode) && sctx->cur_inode_deleted) {
> +		/*
> +		 * Check if we can already rmdir the directory. If not,
> +		 * orphanize it. For every dir item inside that gets deleted
> +		 * later, we do this check again and rmdir it then if possible.
> +		 * See the use of check_dirs for more details.
> +		 */
> +		ret = can_rmdir(sctx, sctx->cur_ino, sctx->cur_ino);
> +		if (ret < 0)
> +			goto out;
> +		if (ret) {
> +			ret = send_rmdir(sctx, valid_path);
> +			if (ret < 0)
> +				goto out;
> +		} else if (!is_orphan) {
> +			ret = orphanize_inode(sctx, sctx->cur_ino,
> +					sctx->cur_inode_gen, valid_path);
> +			if (ret < 0)
> +				goto out;
> +			is_orphan = 1;
> +		}
> +
> +		list_for_each_entry(cur, &sctx->deleted_refs, list) {
> +			ret = ulist_add(check_dirs, cur->dir, cur->dir_gen,
> +					GFP_NOFS);
> +			if (ret < 0)
> +				goto out;
> +		}
> +	} else if (!S_ISDIR(sctx->cur_inode_mode)) {
> +		/*
> +		 * We have a non dir inode. Go through all deleted refs and
> +		 * unlink them if they were not already overwritten by other
> +		 * inodes.
> +		 */
> +		list_for_each_entry(cur, &sctx->deleted_refs, list) {
> +			ret = did_overwrite_ref(sctx, cur->dir, cur->dir_gen,
> +					sctx->cur_ino, sctx->cur_inode_gen,
> +					cur->name, cur->name_len);
> +			if (ret < 0)
> +				goto out;
> +			if (!ret) {
> +				ret = send_unlink(sctx, cur->full_path);
> +				if (ret < 0)
> +					goto out;
> +			}
> +			ret = ulist_add(check_dirs, cur->dir, cur->dir_gen,
> +					GFP_NOFS);
> +			if (ret < 0)
> +				goto out;
> +		}
> +
> +		/*
> +		 * If the inode is still orphan, unlink the orphan. This may
> +		 * happen when a previous inode did overwrite the first ref
> +		 * of this inode and no new refs were added for the current
> +		 * inode.
> +		 */
> +		if (is_orphan) {
> +			ret = send_unlink(sctx, valid_path);
> +			if (ret < 0)
> +				goto out;
> +		}
> +	}
> +
> +	/*
> +	 * We did collect all parent dirs where cur_inode was once located. We
> +	 * now go through all these dirs and check if they are pending for
> +	 * deletion and if it's finally possible to perform the rmdir now.
> +	 * We also update the inode stats of the parent dirs here.
> +	 */
> +	ULIST_ITER_INIT(&uit);
> +	while ((un = ulist_next(check_dirs, &uit))) {
> +		if (un->val > sctx->cur_ino)
> +			continue;
> +
> +		ret = get_cur_inode_state(sctx, un->val, un->aux);
> +		if (ret < 0)
> +			goto out;
> +
> +		if (ret == inode_state_did_create ||
> +		    ret == inode_state_no_change) {
> +			/* TODO delayed utimes */
> +			ret = send_utimes(sctx, un->val, un->aux);
> +			if (ret < 0)
> +				goto out;
> +		} else if (ret == inode_state_did_delete) {
> +			ret = can_rmdir(sctx, un->val, sctx->cur_ino);
> +			if (ret < 0)
> +				goto out;
> +			if (ret) {
> +				ret = get_cur_path(sctx, un->val, un->aux,
> +						valid_path);
> +				if (ret < 0)
> +					goto out;
> +				ret = send_rmdir(sctx, valid_path);
> +				if (ret < 0)
> +					goto out;
> +			}
> +		}
> +	}
> +
> +	/*
> +	 * Current inode is now at it's new position, so we must increase
                                   its
> +	 * send_progress
> +	 */
> +	sctx->send_progress = sctx->cur_ino + 1;

is this the right place for it, or should be done at the calling
site?

> +
> +	ret = 0;
> +
> +out:
> +	free_recorded_refs(sctx);
> +	ulist_free(check_dirs);
> +	fs_path_free(sctx, valid_path);
> +	return ret;
> +}
> +
> +static int __record_new_ref(int num, u64 dir, int index,
> +			    struct fs_path *name,
> +			    void *ctx)
> +{
> +	int ret = 0;
> +	struct send_ctx *sctx = ctx;
> +	struct fs_path *p;
> +	u64 gen;
> +
> +	p = fs_path_alloc(sctx);
> +	if (!p)
> +		return -ENOMEM;
> +
> +	ret = get_inode_info(sctx->send_root, dir, NULL, &gen, NULL, NULL,
> +			NULL);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = get_cur_path(sctx, dir, gen, p);
> +	if (ret < 0)
> +		goto out;
> +	ret = fs_path_add_path(p, name);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = record_ref(&sctx->new_refs, dir, gen, p);
> +
> +out:
> +	if (ret)
> +		fs_path_free(sctx, p);
> +	return ret;
> +}
> +
> +static int __record_deleted_ref(int num, u64 dir, int index,
> +				struct fs_path *name,
> +				void *ctx)
> +{
> +	int ret = 0;
> +	struct send_ctx *sctx = ctx;
> +	struct fs_path *p;
> +	u64 gen;
> +
> +	p = fs_path_alloc(sctx);
> +	if (!p)
> +		return -ENOMEM;
> +
> +	ret = get_inode_info(sctx->parent_root, dir, NULL, &gen, NULL, NULL,
> +			NULL);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = get_cur_path(sctx, dir, gen, p);
> +	if (ret < 0)
> +		goto out;
> +	ret = fs_path_add_path(p, name);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = record_ref(&sctx->deleted_refs, dir, gen, p);
> +
> +out:
> +	if (ret)
> +		fs_path_free(sctx, p);
> +	return ret;
> +}
> +
> +static int record_new_ref(struct send_ctx *sctx)
> +{
> +	int ret;
> +
> +	ret = iterate_inode_ref(sctx, sctx->send_root, sctx->left_path,
> +			sctx->cmp_key, 0, __record_new_ref, sctx);
> +
> +	return ret;
> +}
> +
> +static int record_deleted_ref(struct send_ctx *sctx)
> +{
> +	int ret;
> +
> +	ret = iterate_inode_ref(sctx, sctx->parent_root, sctx->right_path,
> +			sctx->cmp_key, 0, __record_deleted_ref, sctx);
> +	return ret;
> +}
> +
> +struct find_ref_ctx {
> +	u64 dir;
> +	struct fs_path *name;
> +	int found_idx;
> +};
> +
> +static int __find_iref(int num, u64 dir, int index,
> +		       struct fs_path *name,
> +		       void *ctx_)
> +{
> +	struct find_ref_ctx *ctx = ctx_;
> +
> +	if (dir == ctx->dir && fs_path_len(name) == fs_path_len(ctx->name) &&
> +	    strncmp(name->start, ctx->name->start, fs_path_len(name)) == 0) {
> +		ctx->found_idx = num;
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static int find_iref(struct send_ctx *sctx,
> +		     struct btrfs_root *root,
> +		     struct btrfs_path *path,
> +		     struct btrfs_key *key,
> +		     u64 dir, struct fs_path *name)
> +{
> +	int ret;
> +	struct find_ref_ctx ctx;
> +
> +	ctx.dir = dir;
> +	ctx.name = name;
> +	ctx.found_idx = -1;
> +
> +	ret = iterate_inode_ref(sctx, root, path, key, 0, __find_iref, &ctx);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (ctx.found_idx == -1)
> +		return -ENOENT;
> +
> +	return ctx.found_idx;
> +}
> +
> +static int __record_changed_new_ref(int num, u64 dir, int index,
> +				    struct fs_path *name,
> +				    void *ctx)
> +{
> +	int ret;
> +	struct send_ctx *sctx = ctx;
> +
> +	ret = find_iref(sctx, sctx->parent_root, sctx->right_path,
> +			sctx->cmp_key, dir, name);
> +	if (ret == -ENOENT)
> +		ret = __record_new_ref(num, dir, index, name, sctx);
> +	else if (ret > 0)
> +		ret = 0;
> +
> +	return ret;
> +}
> +
> +static int __record_changed_deleted_ref(int num, u64 dir, int index,
> +					struct fs_path *name,
> +					void *ctx)
> +{
> +	int ret;
> +	struct send_ctx *sctx = ctx;
> +
> +	ret = find_iref(sctx, sctx->send_root, sctx->left_path, sctx->cmp_key,
> +			dir, name);
> +	if (ret == -ENOENT)
> +		ret = __record_deleted_ref(num, dir, index, name, sctx);
> +	else if (ret > 0)
> +		ret = 0;
> +
> +	return ret;
> +}
> +
> +static int record_changed_ref(struct send_ctx *sctx)
> +{
> +	int ret = 0;
> +
> +	ret = iterate_inode_ref(sctx, sctx->send_root, sctx->left_path,
> +			sctx->cmp_key, 0, __record_changed_new_ref, sctx);
> +	if (ret < 0)
> +		goto out;
> +	ret = iterate_inode_ref(sctx, sctx->parent_root, sctx->right_path,
> +			sctx->cmp_key, 0, __record_changed_deleted_ref, sctx);
> +
> +out:
> +	return ret;
> +}
> +
> +/*
> + * Record and process all refs at once. Needed when an inode changes the
> + * generation number, which means that it was deleted and recreated.
> + */
> +static int process_all_refs(struct send_ctx *sctx,
> +			    enum btrfs_compare_tree_result cmd)
> +{
> +	int ret;
> +	struct btrfs_root *root;
> +	struct btrfs_path *path;
> +	struct btrfs_key key;
> +	struct btrfs_key found_key;
> +	struct extent_buffer *eb;
> +	int slot;
> +	iterate_inode_ref_t cb;
> +
> +	path = alloc_path_for_send();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	if (cmd == BTRFS_COMPARE_TREE_NEW) {
> +		root = sctx->send_root;
> +		cb = __record_new_ref;
> +	} else if (cmd == BTRFS_COMPARE_TREE_DELETED) {
> +		root = sctx->parent_root;
> +		cb = __record_deleted_ref;
> +	} else {
> +		BUG();
> +	}
> +
> +	key.objectid = sctx->cmp_key->objectid;
> +	key.type = BTRFS_INODE_REF_KEY;
> +	key.offset = 0;
> +	while (1) {
> +		ret = btrfs_search_slot_for_read(root, &key, path, 1, 0);
> +		if (ret < 0) {
> +			btrfs_release_path(path);

not needed

> +			goto out;
> +		}
> +		if (ret) {
> +			btrfs_release_path(path);

ditto

> +			break;
> +		}
> +
> +		eb = path->nodes[0];
> +		slot = path->slots[0];
> +		btrfs_item_key_to_cpu(eb, &found_key, slot);
> +
> +		if (found_key.objectid != key.objectid ||
> +		    found_key.type != key.type) {
> +			btrfs_release_path(path);

and here

> +			break;
> +		}

helper :)

> +
> +		ret = iterate_inode_ref(sctx, sctx->parent_root, path,
> +				&found_key, 0, cb, sctx);
> +		btrfs_release_path(path);
> +		if (ret < 0)
> +			goto out;
> +
> +		key.offset = found_key.offset + 1;
> +	}
> +
> +	ret = process_recorded_refs(sctx);
> +
> +out:
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
> +static int send_set_xattr(struct send_ctx *sctx,
> +			  struct fs_path *path,
> +			  const char *name, int name_len,
> +			  const char *data, int data_len)
> +{
> +	int ret = 0;
> +
> +	ret = begin_cmd(sctx, BTRFS_SEND_C_SET_XATTR);
> +	if (ret < 0)
> +		goto out;
> +
> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path);
> +	TLV_PUT_STRING(sctx, BTRFS_SEND_A_XATTR_NAME, name, name_len);
> +	TLV_PUT(sctx, BTRFS_SEND_A_XATTR_DATA, data, data_len);
> +
> +	ret = send_cmd(sctx);
> +
> +tlv_put_failure:
> +out:
> +	return ret;
> +}
> +
> +static int send_remove_xattr(struct send_ctx *sctx,
> +			  struct fs_path *path,
> +			  const char *name, int name_len)
> +{
> +	int ret = 0;
> +
> +	ret = begin_cmd(sctx, BTRFS_SEND_C_REMOVE_XATTR);
> +	if (ret < 0)
> +		goto out;
> +
> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path);
> +	TLV_PUT_STRING(sctx, BTRFS_SEND_A_XATTR_NAME, name, name_len);
> +
> +	ret = send_cmd(sctx);
> +
> +tlv_put_failure:
> +out:
> +	return ret;
> +}
> +
> +static int __process_new_xattr(int num, const char *name, int name_len,
> +			       const char *data, int data_len,
> +			       u8 type, void *ctx)
> +{
> +	int ret;
> +	struct send_ctx *sctx = ctx;
> +	struct fs_path *p;
> +	posix_acl_xattr_header dummy_acl;
> +
> +	p = fs_path_alloc(sctx);
> +	if (!p)
> +		return -ENOMEM;
> +
> +	/*
> +	 * This hack is needed because empty acl's are stored as zero byte
> +	 * data in xattrs. Problem with that is, that receiving these zero byte
> +	 * acl's will fail later. To fix this, we send a dummy acl list that
> +	 * only contains the version number and no entries.
> +	 */
> +	if (!strncmp(name, XATTR_NAME_POSIX_ACL_ACCESS, name_len) ||
> +	    !strncmp(name, XATTR_NAME_POSIX_ACL_DEFAULT, name_len)) {
> +		if (data_len == 0) {
> +			dummy_acl.a_version =
> +					cpu_to_le32(POSIX_ACL_XATTR_VERSION);
> +			data = (char *)&dummy_acl;
> +			data_len = sizeof(dummy_acl);
> +		}
> +	}
> +
> +	ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = send_set_xattr(sctx, p, name, name_len, data, data_len);
> +
> +out:
> +	fs_path_free(sctx, p);
> +	return ret;
> +}
> +
> +static int __process_deleted_xattr(int num, const char *name, int name_len,
> +				   const char *data, int data_len,
> +				   u8 type, void *ctx)
> +{
> +	int ret;
> +	struct send_ctx *sctx = ctx;
> +	struct fs_path *p;
> +
> +	p = fs_path_alloc(sctx);
> +	if (!p)
> +		return -ENOMEM;
> +
> +	ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = send_remove_xattr(sctx, p, name, name_len);
> +
> +out:
> +	fs_path_free(sctx, p);
> +	return ret;
> +}
> +
> +static int process_new_xattr(struct send_ctx *sctx)
> +{
> +	int ret = 0;
> +
> +	ret = iterate_dir_item(sctx, sctx->send_root, sctx->left_path,
> +			sctx->cmp_key, __process_new_xattr, sctx);
> +
> +	return ret;
> +}
> +
> +static int process_deleted_xattr(struct send_ctx *sctx)
> +{
> +	int ret;
> +
> +	ret = iterate_dir_item(sctx, sctx->parent_root, sctx->right_path,
> +			sctx->cmp_key, __process_deleted_xattr, sctx);
> +
> +	return ret;
> +}
> +
> +struct find_xattr_ctx {
> +	const char *name;
> +	int name_len;
> +	int found_idx;
> +	char *found_data;
> +	int found_data_len;
> +};
> +
> +static int __find_xattr(int num, const char *name, int name_len,
> +			const char *data, int data_len,
> +			u8 type, void *vctx)
> +{
> +	struct find_xattr_ctx *ctx = vctx;
> +
> +	if (name_len == ctx->name_len &&
> +	    strncmp(name, ctx->name, name_len) == 0) {
> +		ctx->found_idx = num;
> +		ctx->found_data_len = data_len;
> +		ctx->found_data = kmalloc(data_len, GFP_NOFS);
> +		if (!ctx->found_data)
> +			return -ENOMEM;
> +		memcpy(ctx->found_data, data, data_len);
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static int find_xattr(struct send_ctx *sctx,
> +		      struct btrfs_root *root,
> +		      struct btrfs_path *path,
> +		      struct btrfs_key *key,
> +		      const char *name, int name_len,
> +		      char **data, int *data_len)
> +{
> +	int ret;
> +	struct find_xattr_ctx ctx;
> +
> +	ctx.name = name;
> +	ctx.name_len = name_len;
> +	ctx.found_idx = -1;
> +	ctx.found_data = NULL;
> +	ctx.found_data_len = 0;
> +
> +	ret = iterate_dir_item(sctx, root, path, key, __find_xattr, &ctx);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (ctx.found_idx == -1)
> +		return -ENOENT;
> +	if (data) {
> +		*data = ctx.found_data;
> +		*data_len = ctx.found_data_len;
> +	} else {
> +		kfree(ctx.found_data);
> +	}
> +	return ctx.found_idx;
> +}
> +
> +
> +static int __process_changed_new_xattr(int num, const char *name, int name_len,
> +				       const char *data, int data_len,
> +				       u8 type, void *ctx)
> +{
> +	int ret;
> +	struct send_ctx *sctx = ctx;
> +	char *found_data = NULL;
> +	int found_data_len  = 0;
> +	struct fs_path *p = NULL;
> +
> +	ret = find_xattr(sctx, sctx->parent_root, sctx->right_path,
> +			sctx->cmp_key, name, name_len, &found_data,
> +			&found_data_len);
> +	if (ret == -ENOENT) {
> +		ret = __process_new_xattr(num, name, name_len, data, data_len,
> +				type, ctx);
> +	} else if (ret >= 0) {
> +		if (data_len != found_data_len ||
> +		    memcmp(data, found_data, data_len)) {
> +			ret = __process_new_xattr(num, name, name_len, data,
> +					data_len, type, ctx);
> +		} else {
> +			ret = 0;
> +		}
> +	}
> +
> +	kfree(found_data);
> +	fs_path_free(sctx, p);
> +	return ret;
> +}
> +
> +static int __process_changed_deleted_xattr(int num, const char *name,
> +					   int name_len,
> +					   const char *data, int data_len,
> +					   u8 type, void *ctx)
> +{
> +	int ret;
> +	struct send_ctx *sctx = ctx;
> +
> +	ret = find_xattr(sctx, sctx->send_root, sctx->left_path, sctx->cmp_key,
> +			name, name_len, NULL, NULL);
> +	if (ret == -ENOENT)
> +		ret = __process_deleted_xattr(num, name, name_len, data,
> +				data_len, type, ctx);
> +	else if (ret >= 0)
> +		ret = 0;
> +
> +	return ret;
> +}
> +
> +static int process_changed_xattr(struct send_ctx *sctx)
> +{
> +	int ret = 0;
> +
> +	ret = iterate_dir_item(sctx, sctx->send_root, sctx->left_path,
> +			sctx->cmp_key, __process_changed_new_xattr, sctx);
> +	if (ret < 0)
> +		goto out;
> +	ret = iterate_dir_item(sctx, sctx->parent_root, sctx->right_path,
> +			sctx->cmp_key, __process_changed_deleted_xattr, sctx);
> +
> +out:
> +	return ret;
> +}
> +
> +static int process_all_new_xattrs(struct send_ctx *sctx)
> +{
> +	int ret;
> +	struct btrfs_root *root;
> +	struct btrfs_path *path;
> +	struct btrfs_key key;
> +	struct btrfs_key found_key;
> +	struct extent_buffer *eb;
> +	int slot;
> +
> +	path = alloc_path_for_send();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	root = sctx->send_root;
> +
> +	key.objectid = sctx->cmp_key->objectid;
> +	key.type = BTRFS_XATTR_ITEM_KEY;
> +	key.offset = 0;
> +	while (1) {
> +		ret = btrfs_search_slot_for_read(root, &key, path, 1, 0);
> +		if (ret < 0)
> +			goto out;
> +		if (ret) {
> +			ret = 0;
> +			goto out;
> +		}
> +
> +		eb = path->nodes[0];
> +		slot = path->slots[0];
> +		btrfs_item_key_to_cpu(eb, &found_key, slot);
> +
> +		if (found_key.objectid != key.objectid ||
> +		    found_key.type != key.type) {
> +			ret = 0;
> +			goto out;
> +		}

helper...

> +
> +		ret = iterate_dir_item(sctx, root, path, &found_key,
> +				__process_new_xattr, sctx);
> +		if (ret < 0)
> +			goto out;
> +
> +		btrfs_release_path(path);
> +		key.offset = found_key.offset + 1;
> +	}
> +
> +out:
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
> +/*
> + * Read some bytes from the current inode/file and send a write command to
> + * user space.
> + */
> +static int send_write(struct send_ctx *sctx, u64 offset, u32 len)
> +{
> +	int ret = 0;
> +	struct fs_path *p;
> +	loff_t pos = offset;
> +	int readed;
> +	mm_segment_t old_fs;
> +
> +	p = fs_path_alloc(sctx);
> +	if (!p)
> +		return -ENOMEM;
> +
> +	/*
> +	 * vfs normally only accepts user space buffers for security reasons.
> +	 * we only read from the file and also only provide the read_buf buffer
> +	 * to vfs. As this buffer does not come from a user space call, it's
> +	 * ok to temporary allow kernel space buffers.
> +	 */
> +	old_fs = get_fs();
> +	set_fs(KERNEL_DS);
> +
> +verbose_printk("btrfs: send_write offset=%llu, len=%d\n", offset, len);
> +
> +	ret = open_cur_inode_file(sctx);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = vfs_read(sctx->cur_inode_filp, sctx->read_buf, len, &pos);
> +	if (ret < 0)
> +		goto out;
> +	readed = ret;

num_read?

> +	if (!readed)
> +		goto out;
> +
> +	ret = begin_cmd(sctx, BTRFS_SEND_C_WRITE);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
> +	if (ret < 0)
> +		goto out;
> +
> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
> +	TLV_PUT_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, offset);
> +	TLV_PUT(sctx, BTRFS_SEND_A_DATA, sctx->read_buf, readed);
> +
> +	ret = send_cmd(sctx);
> +
> +tlv_put_failure:
> +out:
> +	fs_path_free(sctx, p);
> +	set_fs(old_fs);
> +	if (ret < 0)
> +		return ret;
> +	return readed;
> +}
> +
> +/*
> + * Send a clone command to user space.
> + */
> +static int send_clone(struct send_ctx *sctx,
> +		      u64 offset, u32 len,
> +		      struct clone_root *clone_root)
> +{
> +	int ret = 0;
> +	struct btrfs_root *clone_root2 = clone_root->root;

a name from hell :)

> +	struct fs_path *p;
> +	u64 gen;
> +
> +verbose_printk("btrfs: send_clone offset=%llu, len=%d, clone_root=%llu, "
> +	       "clone_inode=%llu, clone_offset=%llu\n", offset, len,
> +		clone_root->root->objectid, clone_root->ino,
> +		clone_root->offset);
> +
> +	p = fs_path_alloc(sctx);
> +	if (!p)
> +		return -ENOMEM;
> +
> +	ret = begin_cmd(sctx, BTRFS_SEND_C_CLONE);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
> +	if (ret < 0)
> +		goto out;
> +
> +	TLV_PUT_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, offset);
> +	TLV_PUT_U64(sctx, BTRFS_SEND_A_CLONE_LEN, len);
> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
> +
> +	if (clone_root2 == sctx->send_root) {
> +		ret = get_inode_info(sctx->send_root, clone_root->ino, NULL,
> +				&gen, NULL, NULL, NULL);
> +		if (ret < 0)
> +			goto out;
> +		ret = get_cur_path(sctx, clone_root->ino, gen, p);
> +	} else {
> +		ret = get_inode_path(sctx, clone_root2, clone_root->ino, p);
> +	}
> +	if (ret < 0)
> +		goto out;
> +
> +	TLV_PUT_UUID(sctx, BTRFS_SEND_A_CLONE_UUID,
> +			clone_root2->root_item.uuid);
> +	TLV_PUT_U64(sctx, BTRFS_SEND_A_CLONE_CTRANSID,
> +			clone_root2->root_item.ctransid);
> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_CLONE_PATH, p);
> +	TLV_PUT_U64(sctx, BTRFS_SEND_A_CLONE_OFFSET,
> +			clone_root->offset);
> +
> +	ret = send_cmd(sctx);
> +
> +tlv_put_failure:
> +out:
> +	fs_path_free(sctx, p);
> +	return ret;
> +}
> +
> +static int send_write_or_clone(struct send_ctx *sctx,
> +			       struct btrfs_path *path,
> +			       struct btrfs_key *key,
> +			       struct clone_root *clone_root)
> +{
> +	int ret = 0;
> +	struct btrfs_file_extent_item *ei;
> +	u64 offset = key->offset;
> +	u64 pos = 0;
> +	u64 len;
> +	u32 l;
> +	u8 type;
> +
> +	ei = btrfs_item_ptr(path->nodes[0], path->slots[0],
> +			struct btrfs_file_extent_item);
> +	type = btrfs_file_extent_type(path->nodes[0], ei);
> +	if (type == BTRFS_FILE_EXTENT_INLINE)
> +		len = btrfs_file_extent_inline_len(path->nodes[0], ei);
> +	else
> +		len = btrfs_file_extent_num_bytes(path->nodes[0], ei);

BTRFS_FILE_EXTENT_PREALLOC?

> +
> +	if (offset + len > sctx->cur_inode_size)
> +		len = sctx->cur_inode_size - offset;
> +	if (len == 0) {
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	if (!clone_root) {
> +		while (pos < len) {
> +			l = len - pos;
> +			if (l > BTRFS_SEND_READ_SIZE)
> +				l = BTRFS_SEND_READ_SIZE;
> +			ret = send_write(sctx, pos + offset, l);
> +			if (ret < 0)
> +				goto out;
> +			if (!ret)
> +				break;
> +			pos += ret;
> +		}
> +		ret = 0;
> +	} else {
> +		ret = send_clone(sctx, offset, len, clone_root);
> +	}
> +
> +out:
> +	return ret;
> +}
> +
> +static int is_extent_unchanged(struct send_ctx *sctx,
> +			       struct btrfs_path *left_path,
> +			       struct btrfs_key *ekey)
> +{
> +	int ret = 0;
> +	struct btrfs_key key;
> +	struct btrfs_path *path = NULL;
> +	struct extent_buffer *eb;
> +	int slot;
> +	struct btrfs_key found_key;
> +	struct btrfs_file_extent_item *ei;
> +	u64 left_disknr;
> +	u64 right_disknr;
> +	u64 left_offset;
> +	u64 right_offset;
> +	u64 left_len;
> +	u64 right_len;
> +	u8 left_type;
> +	u8 right_type;
> +
> +	path = alloc_path_for_send();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	eb = left_path->nodes[0];
> +	slot = left_path->slots[0];
> +
> +	ei = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item);
> +	left_type = btrfs_file_extent_type(eb, ei);
> +	left_disknr = btrfs_file_extent_disk_bytenr(eb, ei);
> +	left_len = btrfs_file_extent_num_bytes(eb, ei);
> +	left_offset = btrfs_file_extent_offset(eb, ei);
> +
> +	if (left_type != BTRFS_FILE_EXTENT_REG) {
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	key.objectid = ekey->objectid;
> +	key.type = BTRFS_EXTENT_DATA_KEY;
> +	key.offset = ekey->offset;
> +
> +	while (1) {
> +		ret = btrfs_search_slot_for_read(sctx->parent_root, &key, path,
> +				0, 0);
> +		if (ret < 0)
> +			goto out;
> +		if (ret) {
> +			ret = 0;
> +			goto out;
> +		}
> +		btrfs_item_key_to_cpu(path->nodes[0], &found_key,
> +				path->slots[0]);
> +		if (found_key.objectid != key.objectid ||
> +		    found_key.type != key.type) {
> +			ret = 0;
> +			goto out;
> +		}
> +

helper...

> +		eb = path->nodes[0];
> +		slot = path->slots[0];
> +
> +		ei = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item);
> +		right_type = btrfs_file_extent_type(eb, ei);
> +		right_disknr = btrfs_file_extent_disk_bytenr(eb, ei);
> +		right_len = btrfs_file_extent_num_bytes(eb, ei);
> +		right_offset = btrfs_file_extent_offset(eb, ei);
> +		btrfs_release_path(path);
> +
> +		if (right_type != BTRFS_FILE_EXTENT_REG) {
> +			ret = 0;
> +			goto out;
> +		}
> +
> +		if (left_disknr != right_disknr) {
> +			ret = 0;
> +			goto out;
> +		}
> +
> +		key.offset = found_key.offset + right_len;
> +		if (key.offset >= ekey->offset + left_len) {
> +			ret = 1;
> +			goto out;
> +		}
> +	}
> +
> +out:
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
> +static int process_extent(struct send_ctx *sctx,
> +			  struct btrfs_path *path,
> +			  struct btrfs_key *key)
> +{
> +	int ret = 0;
> +	struct clone_root *found_clone = NULL;
> +
> +	if (S_ISLNK(sctx->cur_inode_mode))
> +		return 0;
> +
> +	if (sctx->parent_root && !sctx->cur_inode_new) {
> +		ret = is_extent_unchanged(sctx, path, key);
> +		if (ret < 0)
> +			goto out;
> +		if (ret) {
> +			ret = 0;
> +			goto out;
> +		}
> +	}
> +
> +	ret = find_extent_clone(sctx, path, key->objectid, key->offset,
> +			sctx->cur_inode_size, &found_clone);
> +	if (ret != -ENOENT && ret < 0)
> +		goto out;
> +
> +	ret = send_write_or_clone(sctx, path, key, found_clone);
> +
> +out:
> +	return ret;
> +}
> +
> +static int process_all_extents(struct send_ctx *sctx)
> +{
> +	int ret;
> +	struct btrfs_root *root;
> +	struct btrfs_path *path;
> +	struct btrfs_key key;
> +	struct btrfs_key found_key;
> +	struct extent_buffer *eb;
> +	int slot;
> +
> +	root = sctx->send_root;
> +	path = alloc_path_for_send();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	key.objectid = sctx->cmp_key->objectid;
> +	key.type = BTRFS_EXTENT_DATA_KEY;
> +	key.offset = 0;
> +	while (1) {
> +		ret = btrfs_search_slot_for_read(root, &key, path, 1, 0);
> +		if (ret < 0)
> +			goto out;
> +		if (ret) {
> +			ret = 0;
> +			goto out;
> +		}
> +
> +		eb = path->nodes[0];
> +		slot = path->slots[0];
> +		btrfs_item_key_to_cpu(eb, &found_key, slot);
> +
> +		if (found_key.objectid != key.objectid ||
> +		    found_key.type != key.type) {
> +			ret = 0;
> +			goto out;
> +		}
> +
> +		ret = process_extent(sctx, path, &found_key);
> +		if (ret < 0)
> +			goto out;
> +
> +		btrfs_release_path(path);
> +		key.offset = found_key.offset + 1;
> +	}
> +
> +out:
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
> +static int process_recorded_refs_if_needed(struct send_ctx *sctx, int at_end)
> +{
> +	int ret = 0;
> +
> +	if (sctx->cur_ino == 0)
> +		goto out;
> +	if (!at_end && sctx->cur_ino == sctx->cmp_key->objectid &&
> +	    sctx->cmp_key->type <= BTRFS_INODE_REF_KEY)
> +		goto out;
> +	if (list_empty(&sctx->new_refs) && list_empty(&sctx->deleted_refs))
> +		goto out;
> +
> +	ret = process_recorded_refs(sctx);
> +
> +out:
> +	return ret;
> +}
> +
> +static int finish_inode_if_needed(struct send_ctx *sctx, int at_end)
> +{
> +	int ret = 0;
> +	u64 left_mode;
> +	u64 left_uid;
> +	u64 left_gid;
> +	u64 right_mode;
> +	u64 right_uid;
> +	u64 right_gid;
> +	int need_chmod = 0;
> +	int need_chown = 0;
> +
> +	ret = process_recorded_refs_if_needed(sctx, at_end);
> +	if (ret < 0)
> +		goto out;
> +
> +	if (sctx->cur_ino == 0 || sctx->cur_inode_deleted)
> +		goto out;
> +	if (!at_end && sctx->cmp_key->objectid == sctx->cur_ino)
> +		goto out;
> +
> +	ret = get_inode_info(sctx->send_root, sctx->cur_ino, NULL, NULL,
> +			&left_mode, &left_uid, &left_gid);
> +	if (ret < 0)
> +		goto out;
> +
> +	if (!S_ISLNK(sctx->cur_inode_mode)) {
> +		if (!sctx->parent_root || sctx->cur_inode_new) {
> +			need_chmod = 1;
> +			need_chown = 1;
> +		} else {
> +			ret = get_inode_info(sctx->parent_root, sctx->cur_ino,
> +					NULL, NULL, &right_mode, &right_uid,
> +					&right_gid);
> +			if (ret < 0)
> +				goto out;
> +
> +			if (left_uid != right_uid || left_gid != right_gid)
> +				need_chown = 1;
> +			if (left_mode != right_mode)
> +				need_chmod = 1;
> +		}
> +	}
> +
> +	if (S_ISREG(sctx->cur_inode_mode)) {
> +		ret = send_truncate(sctx, sctx->cur_ino, sctx->cur_inode_gen,
> +				sctx->cur_inode_size);
> +		if (ret < 0)
> +			goto out;
> +	}
> +
> +	if (need_chown) {
> +		ret = send_chown(sctx, sctx->cur_ino, sctx->cur_inode_gen,
> +				left_uid, left_gid);
> +		if (ret < 0)
> +			goto out;
> +	}
> +	if (need_chmod) {
> +		ret = send_chmod(sctx, sctx->cur_ino, sctx->cur_inode_gen,
> +				left_mode);
> +		if (ret < 0)
> +			goto out;
> +	}
> +
> +	/*
> +	 * Need to send that every time, no matter if it actually changed
> +	 * between the two trees as we have done changes to the inode before.
> +	 */
> +	ret = send_utimes(sctx, sctx->cur_ino, sctx->cur_inode_gen);
> +	if (ret < 0)
> +		goto out;
> +
> +out:
> +	return ret;
> +}
> +
> +static int changed_inode(struct send_ctx *sctx,
> +			 enum btrfs_compare_tree_result result)
> +{
> +	int ret = 0;
> +	struct btrfs_key *key = sctx->cmp_key;
> +	struct btrfs_inode_item *left_ii = NULL;
> +	struct btrfs_inode_item *right_ii = NULL;
> +	u64 left_gen = 0;
> +	u64 right_gen = 0;
> +
> +	ret = close_cur_inode_file(sctx);
> +	if (ret < 0)
> +		goto out;
> +
> +	sctx->cur_ino = key->objectid;
> +	sctx->cur_inode_new_gen = 0;
> +	sctx->send_progress = sctx->cur_ino;
> +
> +	if (result == BTRFS_COMPARE_TREE_NEW ||
> +	    result == BTRFS_COMPARE_TREE_CHANGED) {
> +		left_ii = btrfs_item_ptr(sctx->left_path->nodes[0],
> +				sctx->left_path->slots[0],
> +				struct btrfs_inode_item);
> +		left_gen = btrfs_inode_generation(sctx->left_path->nodes[0],
> +				left_ii);
> +	} else {
> +		right_ii = btrfs_item_ptr(sctx->right_path->nodes[0],
> +				sctx->right_path->slots[0],
> +				struct btrfs_inode_item);
> +		right_gen = btrfs_inode_generation(sctx->right_path->nodes[0],
> +				right_ii);
> +	}
> +	if (result == BTRFS_COMPARE_TREE_CHANGED) {
> +		right_ii = btrfs_item_ptr(sctx->right_path->nodes[0],
> +				sctx->right_path->slots[0],
> +				struct btrfs_inode_item);
> +
> +		right_gen = btrfs_inode_generation(sctx->right_path->nodes[0],
> +				right_ii);
> +		if (left_gen != right_gen)
> +			sctx->cur_inode_new_gen = 1;
> +	}
> +
> +	if (result == BTRFS_COMPARE_TREE_NEW) {
> +		sctx->cur_inode_gen = left_gen;
> +		sctx->cur_inode_new = 1;
> +		sctx->cur_inode_deleted = 0;
> +		sctx->cur_inode_size = btrfs_inode_size(
> +				sctx->left_path->nodes[0], left_ii);
> +		sctx->cur_inode_mode = btrfs_inode_mode(
> +				sctx->left_path->nodes[0], left_ii);
> +		if (sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID)
> +			ret = send_create_inode(sctx, sctx->left_path,
> +					sctx->cmp_key);
> +	} else if (result == BTRFS_COMPARE_TREE_DELETED) {
> +		sctx->cur_inode_gen = right_gen;
> +		sctx->cur_inode_new = 0;
> +		sctx->cur_inode_deleted = 1;
> +		sctx->cur_inode_size = btrfs_inode_size(
> +				sctx->right_path->nodes[0], right_ii);
> +		sctx->cur_inode_mode = btrfs_inode_mode(
> +				sctx->right_path->nodes[0], right_ii);
> +	} else if (result == BTRFS_COMPARE_TREE_CHANGED) {
> +		if (sctx->cur_inode_new_gen) {
> +			sctx->cur_inode_gen = right_gen;
> +			sctx->cur_inode_new = 0;
> +			sctx->cur_inode_deleted = 1;
> +			sctx->cur_inode_size = btrfs_inode_size(
> +					sctx->right_path->nodes[0], right_ii);
> +			sctx->cur_inode_mode = btrfs_inode_mode(
> +					sctx->right_path->nodes[0], right_ii);
> +			ret = process_all_refs(sctx,
> +					BTRFS_COMPARE_TREE_DELETED);
> +			if (ret < 0)
> +				goto out;
> +
> +			sctx->cur_inode_gen = left_gen;
> +			sctx->cur_inode_new = 1;
> +			sctx->cur_inode_deleted = 0;
> +			sctx->cur_inode_size = btrfs_inode_size(
> +					sctx->left_path->nodes[0], left_ii);
> +			sctx->cur_inode_mode = btrfs_inode_mode(
> +					sctx->left_path->nodes[0], left_ii);
> +			ret = send_create_inode(sctx, sctx->left_path,
> +					sctx->cmp_key);
> +			if (ret < 0)
> +				goto out;
> +
> +			ret = process_all_refs(sctx, BTRFS_COMPARE_TREE_NEW);
> +			if (ret < 0)
> +				goto out;
> +			ret = process_all_extents(sctx);
> +			if (ret < 0)
> +				goto out;
> +			ret = process_all_new_xattrs(sctx);
> +			if (ret < 0)
> +				goto out;
> +		} else {
> +			sctx->cur_inode_gen = left_gen;
> +			sctx->cur_inode_new = 0;
> +			sctx->cur_inode_new_gen = 0;
> +			sctx->cur_inode_deleted = 0;
> +			sctx->cur_inode_size = btrfs_inode_size(
> +					sctx->left_path->nodes[0], left_ii);
> +			sctx->cur_inode_mode = btrfs_inode_mode(
> +					sctx->left_path->nodes[0], left_ii);
> +		}
> +	}
> +
> +out:
> +	return ret;
> +}
> +
> +static int changed_ref(struct send_ctx *sctx,
> +		       enum btrfs_compare_tree_result result)
> +{
> +	int ret = 0;
> +
> +	BUG_ON(sctx->cur_ino != sctx->cmp_key->objectid);
> +
> +	if (!sctx->cur_inode_new_gen &&
> +	    sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID) {
> +		if (result == BTRFS_COMPARE_TREE_NEW)
> +			ret = record_new_ref(sctx);
> +		else if (result == BTRFS_COMPARE_TREE_DELETED)
> +			ret = record_deleted_ref(sctx);
> +		else if (result == BTRFS_COMPARE_TREE_CHANGED)
> +			ret = record_changed_ref(sctx);
> +	}
> +
> +	return ret;
> +}
> +
> +static int changed_xattr(struct send_ctx *sctx,
> +			 enum btrfs_compare_tree_result result)
> +{
> +	int ret = 0;
> +
> +	BUG_ON(sctx->cur_ino != sctx->cmp_key->objectid);
> +
> +	if (!sctx->cur_inode_new_gen && !sctx->cur_inode_deleted) {
> +		if (result == BTRFS_COMPARE_TREE_NEW)
> +			ret = process_new_xattr(sctx);
> +		else if (result == BTRFS_COMPARE_TREE_DELETED)
> +			ret = process_deleted_xattr(sctx);
> +		else if (result == BTRFS_COMPARE_TREE_CHANGED)
> +			ret = process_changed_xattr(sctx);
> +	}
> +
> +	return ret;
> +}
> +
> +static int changed_extent(struct send_ctx *sctx,
> +			  enum btrfs_compare_tree_result result)
> +{
> +	int ret = 0;
> +
> +	BUG_ON(sctx->cur_ino != sctx->cmp_key->objectid);
> +
> +	if (!sctx->cur_inode_new_gen && !sctx->cur_inode_deleted) {
> +		if (result != BTRFS_COMPARE_TREE_DELETED)
> +			ret = process_extent(sctx, sctx->left_path,
> +					sctx->cmp_key);
> +	}
> +
> +	return ret;
> +}
> +
> +
> +static int changed_cb(struct btrfs_root *left_root,
> +		      struct btrfs_root *right_root,
> +		      struct btrfs_path *left_path,
> +		      struct btrfs_path *right_path,
> +		      struct btrfs_key *key,
> +		      enum btrfs_compare_tree_result result,
> +		      void *ctx)
> +{
> +	int ret = 0;
> +	struct send_ctx *sctx = ctx;
> +
> +	sctx->left_path = left_path;
> +	sctx->right_path = right_path;
> +	sctx->cmp_key = key;
> +
> +	ret = finish_inode_if_needed(sctx, 0);
> +	if (ret < 0)
> +		goto out;
> +
> +	if (key->type == BTRFS_INODE_ITEM_KEY)
> +		ret = changed_inode(sctx, result);
> +	else if (key->type == BTRFS_INODE_REF_KEY)
> +		ret = changed_ref(sctx, result);
> +	else if (key->type == BTRFS_XATTR_ITEM_KEY)
> +		ret = changed_xattr(sctx, result);
> +	else if (key->type == BTRFS_EXTENT_DATA_KEY)
> +		ret = changed_extent(sctx, result);
> +
> +out:
> +	return ret;
> +}
> +
> +static int full_send_tree(struct send_ctx *sctx)
> +{
> +	int ret;
> +	struct btrfs_trans_handle *trans = NULL;
> +	struct btrfs_root *send_root = sctx->send_root;
> +	struct btrfs_key key;
> +	struct btrfs_key found_key;
> +	struct btrfs_path *path;
> +	struct extent_buffer *eb;
> +	int slot;
> +	u64 start_ctransid;
> +	u64 ctransid;
> +
> +	path = alloc_path_for_send();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	spin_lock(&send_root->root_times_lock);
> +	start_ctransid = btrfs_root_ctransid(&send_root->root_item);
> +	spin_unlock(&send_root->root_times_lock);
> +
> +	key.objectid = BTRFS_FIRST_FREE_OBJECTID;
> +	key.type = BTRFS_INODE_ITEM_KEY;
> +	key.offset = 0;
> +
> +join_trans:
> +	/*
> +	 * We need to make sure the transaction does not get committed
> +	 * while we do anything on commit roots. Join a transaction to prevent
> +	 * this.
> +	 */
> +	trans = btrfs_join_transaction(send_root);
> +	if (IS_ERR(trans)) {
> +		ret = PTR_ERR(trans);
> +		trans = NULL;
> +		goto out;
> +	}
> +
> +	/*
> +	 * Make sure the tree has not changed
> +	 */
> +	spin_lock(&send_root->root_times_lock);
> +	ctransid = btrfs_root_ctransid(&send_root->root_item);
> +	spin_unlock(&send_root->root_times_lock);
> +
> +	if (ctransid != start_ctransid) {
> +		WARN(1, KERN_WARNING "btrfs: the root that you're trying to "
> +				     "send was modified in between. This is "
> +				     "probably a bug.\n");

What is the purpose of getting the ctransid outside the
transaction anyway?

> +		ret = -EIO;
> +		goto out;
> +	}
> +
> +	ret = btrfs_search_slot_for_read(send_root, &key, path, 1, 0);
> +	if (ret < 0)
> +		goto out;
> +	if (ret)
> +		goto out_finish;
> +
> +	while (1) {
> +		/*
> +		 * When someone want to commit while we iterate, end the
> +		 * joined transaction and rejoin.
> +		 */
> +		if (btrfs_should_end_transaction(trans, send_root)) {
> +			ret = btrfs_end_transaction(trans, send_root);
> +			trans = NULL;
> +			if (ret < 0)
> +				goto out;
> +			btrfs_release_path(path);
> +			goto join_trans;
> +		}
> +
> +		eb = path->nodes[0];
> +		slot = path->slots[0];
> +		btrfs_item_key_to_cpu(eb, &found_key, slot);
> +
> +		ret = changed_cb(send_root, NULL, path, NULL,
> +				&found_key, BTRFS_COMPARE_TREE_NEW, sctx);
> +		if (ret < 0)
> +			goto out;
> +
> +		key.objectid = found_key.objectid;
> +		key.type = found_key.type;
> +		key.offset = found_key.offset + 1;

shouldn't this just be before the goto join_trans?

> +
> +		ret = btrfs_next_item(send_root, path);
> +		if (ret < 0)
> +			goto out;
> +		if (ret) {
> +			ret  = 0;
> +			break;
> +		}
> +	}
> +
> +out_finish:
> +	ret = finish_inode_if_needed(sctx, 1);
> +
> +out:
> +	btrfs_free_path(path);
> +	if (trans) {
> +		if (!ret)
> +			ret = btrfs_end_transaction(trans, send_root);
> +		else
> +			btrfs_end_transaction(trans, send_root);
> +	}
> +	return ret;
> +}
> +
> +static int send_subvol(struct send_ctx *sctx)
> +{
> +	int ret;
> +
> +	ret = send_header(sctx);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = send_subvol_begin(sctx);
> +	if (ret < 0)
> +		goto out;
> +
> +	if (sctx->parent_root) {
> +		ret = btrfs_compare_trees(sctx->send_root, sctx->parent_root,
> +				changed_cb, sctx);
> +		if (ret < 0)
> +			goto out;
> +		ret = finish_inode_if_needed(sctx, 1);
> +		if (ret < 0)
> +			goto out;
> +	} else {
> +		ret = full_send_tree(sctx);
> +		if (ret < 0)
> +			goto out;
> +	}
> +
> +out:
> +	if (!ret)
> +		ret = close_cur_inode_file(sctx);
> +	else
> +		close_cur_inode_file(sctx);
> +
> +	free_recorded_refs(sctx);
> +	return ret;
> +}
> +
> +long btrfs_ioctl_send(struct file *mnt_file, void __user *arg_)
> +{
> +	int ret = 0;
> +	struct btrfs_root *send_root;
> +	struct btrfs_root *clone_root;
> +	struct btrfs_fs_info *fs_info;
> +	struct btrfs_ioctl_send_args *arg = NULL;
> +	struct btrfs_key key;
> +	struct file *filp = NULL;
> +	struct send_ctx *sctx = NULL;
> +	u32 i;
> +	u64 *clone_sources_tmp = NULL;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	send_root = BTRFS_I(fdentry(mnt_file)->d_inode)->root;
> +	fs_info = send_root->fs_info;
> +
> +	arg = memdup_user(arg_, sizeof(*arg));
> +	if (IS_ERR(arg)) {
> +		ret = PTR_ERR(arg);
> +		arg = NULL;
> +		goto out;
> +	}
> +
> +	if (!access_ok(VERIFY_READ, arg->clone_sources,
> +			sizeof(*arg->clone_sources *
> +			arg->clone_sources_count))) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	sctx = kzalloc(sizeof(struct send_ctx), GFP_NOFS);
> +	if (!sctx) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	INIT_LIST_HEAD(&sctx->new_refs);
> +	INIT_LIST_HEAD(&sctx->deleted_refs);
> +	INIT_RADIX_TREE(&sctx->name_cache, GFP_NOFS);
> +	INIT_LIST_HEAD(&sctx->name_cache_list);
> +
> +	sctx->send_filp = fget(arg->send_fd);
> +	if (IS_ERR(sctx->send_filp)) {
> +		ret = PTR_ERR(sctx->send_filp);
> +		goto out;
> +	}
> +
> +	sctx->mnt = mnt_file->f_path.mnt;
> +
> +	sctx->send_root = send_root;
> +	sctx->clone_roots_cnt = arg->clone_sources_count;
> +
> +	sctx->send_max_size = BTRFS_SEND_BUF_SIZE;
> +	sctx->send_buf = vmalloc(sctx->send_max_size);
> +	if (!sctx->send_buf) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	sctx->read_buf = vmalloc(BTRFS_SEND_READ_SIZE);
> +	if (!sctx->read_buf) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	sctx->clone_roots = vzalloc(sizeof(struct clone_root) *
> +			(arg->clone_sources_count + 1));
> +	if (!sctx->clone_roots) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	if (arg->clone_sources_count) {
> +		clone_sources_tmp = vmalloc(arg->clone_sources_count *
> +				sizeof(*arg->clone_sources));
> +		if (!clone_sources_tmp) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		ret = copy_from_user(clone_sources_tmp, arg->clone_sources,
> +				arg->clone_sources_count *
> +				sizeof(*arg->clone_sources));
> +		if (ret) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +
> +		for (i = 0; i < arg->clone_sources_count; i++) {
> +			key.objectid = clone_sources_tmp[i];
> +			key.type = BTRFS_ROOT_ITEM_KEY;
> +			key.offset = (u64)-1;
> +			clone_root = btrfs_read_fs_root_no_name(fs_info, &key);
> +			if (!clone_root) {
> +				ret = -EINVAL;
> +				goto out;
> +			}
> +			if (IS_ERR(clone_root)) {
> +				ret = PTR_ERR(clone_root);
> +				goto out;
> +			}
> +			sctx->clone_roots[i].root = clone_root;
> +		}
> +		vfree(clone_sources_tmp);
> +		clone_sources_tmp = NULL;
> +	}
> +
> +	if (arg->parent_root) {
> +		key.objectid = arg->parent_root;
> +		key.type = BTRFS_ROOT_ITEM_KEY;
> +		key.offset = (u64)-1;
> +		sctx->parent_root = btrfs_read_fs_root_no_name(fs_info, &key);
> +		if (!sctx->parent_root) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +	}
> +
> +	/*
> +	 * Clones from send_root are allowed, but only if the clone source
> +	 * is behind the current send position. This is checked while searching
> +	 * for possible clone sources.
> +	 */
> +	sctx->clone_roots[sctx->clone_roots_cnt++].root = sctx->send_root;
> +
> +	/* We do a bsearch later */
> +	sort(sctx->clone_roots, sctx->clone_roots_cnt,
> +			sizeof(*sctx->clone_roots), __clone_root_cmp_sort,
> +			NULL);
> +
> +	ret = send_subvol(sctx);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = begin_cmd(sctx, BTRFS_SEND_C_END);
> +	if (ret < 0)
> +		goto out;
> +	ret = send_cmd(sctx);
> +	if (ret < 0)
> +		goto out;
> +
> +out:
> +	if (filp)
> +		fput(filp);
> +	kfree(arg);
> +	vfree(clone_sources_tmp);
> +
> +	if (sctx) {
> +		if (sctx->send_filp)
> +			fput(sctx->send_filp);
> +
> +		vfree(sctx->clone_roots);
> +		vfree(sctx->send_buf);
> +		vfree(sctx->read_buf);
> +
> +		name_cache_free(sctx);
> +
> +		kfree(sctx);
> +	}
> +
> +	return ret;
> +}
> diff --git a/fs/btrfs/send.h b/fs/btrfs/send.h
> index a4c23ee..53f8ee7 100644
> --- a/fs/btrfs/send.h
> +++ b/fs/btrfs/send.h
> @@ -124,3 +124,7 @@ enum {
>  	__BTRFS_SEND_A_MAX,
>  };
>  #define BTRFS_SEND_A_MAX (__BTRFS_SEND_A_MAX - 1)
> +
> +#ifdef __KERNEL__
> +long btrfs_ioctl_send(struct file *mnt_file, void __user *arg);
> +#endif


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 7/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 2)
  2012-07-04 13:38 ` [RFC PATCH 7/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 2) Alexander Block
  2012-07-10 15:26   ` Alex Lyakas
  2012-07-23 11:16   ` Arne Jansen
@ 2012-07-23 15:17   ` Alex Lyakas
  2012-08-01 12:54     ` Alexander Block
  2 siblings, 1 reply; 43+ messages in thread
From: Alex Lyakas @ 2012-07-23 15:17 UTC (permalink / raw)
  To: Alexander Block; +Cc: linux-btrfs

Hi Alexander,
I did some testing of the case where same inode, but with a different
generation, exists both in send_root and in parent_root.
I know that this can happen primarily when "inode_cache" option is
enabled. So first I just tested some differential sends, where parent
and root are unrelated subvolumes. Here are some issues:

1) The top subvolume inode (ino=BTRFS_FIRST_FREE_OBJECTID) is treated
also as deleted + recreated. So the code goes into process_all_refs()
path and does several strange things, such as trying to orphanize the
top inode. Also get_cur_path() always returns "" for the top subvolume
(without checking whether it is an orphan).  Another complication for
the top inode is that its parent dir is itself.
I made the following fix:
@@ -3782,7 +3972,13 @@ static int changed_inode(struct send_ctx *sctx,

                right_gen = btrfs_inode_generation(sctx->right_path->nodes[0],
                                right_ii);
-               if (left_gen != right_gen)
+               if (left_gen != right_gen && sctx->cur_ino !=
BTRFS_FIRST_FREE_OBJECTID)
                        sctx->cur_inode_new_gen = 1;

So basically, don't try to delete and re-create it, but treat it like
a change. Since the top subvolume inode is S_IFDIR, and dir can have
only one hardlink (and hopefully it is always ".."), we will never
need to change anything for this INODE_REF. I also added:

@@ -2526,6 +2615,14 @@ static int process_recorded_refs(struct send_ctx *sctx)
        int did_overwrite = 0;
        int is_orphan = 0;

+       BUG_ON(sctx->cur_ino <= BTRFS_FIRST_FREE_OBJECTID);

2) After I fixed this, I hit another issue, where inodes under the top
subvolume dir, attempt to rmdir() the top dir, while iterating over
check_dirs in process_recorded_refs(), because (top_dir_ino,
top_dir_gen) indicate that it was deleted. So I added:

@@ -2714,10 +2857,19 @@ verbose_printk("btrfs: process_recorded_refs
%llu\n", sctx->cur_ino);
         */
        ULIST_ITER_INIT(&uit);
        while ((un = ulist_next(check_dirs, &uit))) {
+               /* Do not attempt to rmdir it the top subvolume dir */
+               if (un->val == BTRFS_FIRST_FREE_OBJECTID)
+                       continue;
+
                if (un->val > sctx->cur_ino)
                        continue;

3) process_recorded_refs() always increments the send_progress:
	/*
	 * Current inode is now at it's new position, so we must increase
	 * send_progress
	 */
	sctx->send_progress = sctx->cur_ino + 1;

However, in the changed_inode() path I am testing, process_all_refs()
is called twice with the same inode (once for deleted inode, once for
the recreated inode), so after the first call, send_progress is
incremented and doesn't match the inode anymore. I don't think I hit
any issues because of this, just that it's confusing.

4)

> +/*
> + * Record and process all refs at once. Needed when an inode changes the
> + * generation number, which means that it was deleted and recreated.
> + */
> +static int process_all_refs(struct send_ctx *sctx,
> +                           enum btrfs_compare_tree_result cmd)
> +{
> +       int ret;
> +       struct btrfs_root *root;
> +       struct btrfs_path *path;
> +       struct btrfs_key key;
> +       struct btrfs_key found_key;
> +       struct extent_buffer *eb;
> +       int slot;
> +       iterate_inode_ref_t cb;
> +
> +       path = alloc_path_for_send();
> +       if (!path)
> +               return -ENOMEM;
> +
> +       if (cmd == BTRFS_COMPARE_TREE_NEW) {
> +               root = sctx->send_root;
> +               cb = __record_new_ref;
> +       } else if (cmd == BTRFS_COMPARE_TREE_DELETED) {
> +               root = sctx->parent_root;
> +               cb = __record_deleted_ref;
> +       } else {
> +               BUG();
> +       }
> +
> +       key.objectid = sctx->cmp_key->objectid;
> +       key.type = BTRFS_INODE_REF_KEY;
> +       key.offset = 0;
> +       while (1) {
> +               ret = btrfs_search_slot_for_read(root, &key, path, 1, 0);
> +               if (ret < 0) {
> +                       btrfs_release_path(path);
> +                       goto out;
> +               }
> +               if (ret) {
> +                       btrfs_release_path(path);
> +                       break;
> +               }
> +
> +               eb = path->nodes[0];
> +               slot = path->slots[0];
> +               btrfs_item_key_to_cpu(eb, &found_key, slot);
> +
> +               if (found_key.objectid != key.objectid ||
> +                   found_key.type != key.type) {
> +                       btrfs_release_path(path);
> +                       break;
> +               }
> +
> +               ret = iterate_inode_ref(sctx, sctx->parent_root, path,
> +                               &found_key, 0, cb, sctx);

Shouldn't it be the root that you calculated eariler and not
sctx->parent_root? I guess in this case it doesn't matter, because
"resolve" is 0, and the passed root is only used for resolve. But
still confusing.

5) When I started testing with "inode_cache" enabled, I hit another
issue. When this mount option is enabled, then FREE_INO and FREE_SPACE
items now appear in the file tree. As a result, the code tries to
create the FREE_INO item with an orphan name, then tries to find its
INODE_REF, but fails because it has no INODE_REFs. So

@@ -3923,6 +4127,13 @@ static int changed_cb(struct btrfs_root *left_root,
        int ret = 0;
        struct send_ctx *sctx = ctx;

+       /* Ignore non-FS objects */
+       if (key->objectid == BTRFS_FREE_INO_OBJECTID ||
+               key->objectid == BTRFS_FREE_SPACE_OBJECTID)
+               return 0;

makes sense?

Thanks,
Alex.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 7/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 2)
  2012-07-23 11:16   ` Arne Jansen
@ 2012-07-23 15:28     ` Alex Lyakas
  2012-07-28 13:49     ` Alexander Block
  1 sibling, 0 replies; 43+ messages in thread
From: Alex Lyakas @ 2012-07-23 15:28 UTC (permalink / raw)
  To: Arne Jansen; +Cc: Alexander Block, linux-btrfs

Hi Arne,

(pls don't take this as if I pretend to have understood the code
better than you, because I have a list of questions for Alexander
too).

>> +/*
>> + * Helper function to generate a file name that is unique in the root of
>> + * send_root and parent_root. This is used to generate names for orphan inodes.
>> + */
>> +static int gen_unique_name(struct send_ctx *sctx,
>> +                        u64 ino, u64 gen,
>> +                        struct fs_path *dest)
>> +{
>> +     int ret = 0;
>> +     struct btrfs_path *path;
>> +     struct btrfs_dir_item *di;
>> +     char tmp[64];
>> +     int len;
>> +     u64 idx = 0;
>> +
>> +     path = alloc_path_for_send();
>> +     if (!path)
>> +             return -ENOMEM;
>> +
>> +     while (1) {
>> +             len = snprintf(tmp, sizeof(tmp) - 1, "o%llu-%llu-%llu",
>> +                             ino, gen, idx);
>
> wouldn't it be easier to just take a uuid? This would save you a lot
> of code and especially the need to verify that the name is really
> unique, saving seeks.

As far as I understand the logic of orphans, the unique name should
depend only on the send_root and parent_root contents, which are both
frozen. So when you re-generate this name for a particular (ino,gen),
you must receive the same exact name every time. If the user has kind
of oXXX-YY-Z file(s) in the top dir by accident, then they are the
same every time we recalculate the orhpan name, so we get the same
result every time. Does it make sense?
So did you mean to generate a uuid here, and save it somewhere
in-memory, and later look it up based on (ino,gen)? Or you mean some
other improvement?

Thanks,
Alex.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 4/7] Btrfs: introduce subvol uuids and times
  2012-07-16 14:56   ` Arne Jansen
@ 2012-07-23 19:41     ` Alexander Block
  2012-07-24  5:55       ` Arne Jansen
  0 siblings, 1 reply; 43+ messages in thread
From: Alexander Block @ 2012-07-23 19:41 UTC (permalink / raw)
  To: Arne Jansen; +Cc: linux-btrfs

On Mon, Jul 16, 2012 at 4:56 PM, Arne Jansen <sensille@gmx.net> wrote:
> On 04.07.2012 15:38, Alexander Block wrote:
>> This patch introduces uuids for subvolumes. Each
>> subvolume has it's own uuid. In case it was snapshotted,
>> it also contains parent_uuid. In case it was received,
>> it also contains received_uuid.
>>
>> It also introduces subvolume ctime/otime/stime/rtime. The
>> first two are comparable to the times found in inodes. otime
>> is the origin/creation time and ctime is the change time.
>> stime/rtime are only valid on received subvolumes.
>> stime is the time of the subvolume when it was
>> sent. rtime is the time of the subvolume when it was
>> received.
>>
>> Additionally to the times, we have a transid for each
>> time. They are updated at the same place as the times.
>>
>> btrfs receive uses stransid and rtransid to find out
>> if a received subvolume changed in the meantime.
>>
>> If an older kernel mounts a filesystem with the
>> extented fields, all fields become invalid. The next
>> mount with a new kernel will detect this and reset the
>> fields.
>>
>> Signed-off-by: Alexander Block <ablock84@googlemail.com>
>> ---
>>  fs/btrfs/ctree.h       |   43 ++++++++++++++++++++++
>>  fs/btrfs/disk-io.c     |    2 +
>>  fs/btrfs/inode.c       |    4 ++
>>  fs/btrfs/ioctl.c       |   96 ++++++++++++++++++++++++++++++++++++++++++++++--
>>  fs/btrfs/ioctl.h       |   13 +++++++
>>  fs/btrfs/root-tree.c   |   92 +++++++++++++++++++++++++++++++++++++++++++---
>>  fs/btrfs/transaction.c |   17 +++++++++
>>  7 files changed, 258 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 8cfde93..2bd5df8 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -709,6 +709,35 @@ struct btrfs_root_item {
>>       struct btrfs_disk_key drop_progress;
>>       u8 drop_level;
>>       u8 level;
>> +
>> +     /*
>> +      * The following fields appear after subvol_uuids+subvol_times
>> +      * were introduced.
>> +      */
>> +
>> +     /*
>> +      * This generation number is used to test if the new fields are valid
>> +      * and up to date while reading the root item. Everytime the root item
>> +      * is written out, the "generation" field is copied into this field. If
>> +      * anyone ever mounted the fs with an older kernel, we will have
>> +      * mismatching generation values here and thus must invalidate the
>> +      * new fields. See btrfs_update_root and btrfs_find_last_root for
>> +      * details.
>> +      * the offset of generation_v2 is also used as the start for the memset
>> +      * when invalidating the fields.
>> +      */
>> +     __le64 generation_v2;
>> +     u8 uuid[BTRFS_UUID_SIZE];
>> +     u8 parent_uuid[BTRFS_UUID_SIZE];
>> +     u8 received_uuid[BTRFS_UUID_SIZE];
>> +     __le64 ctransid; /* updated when an inode changes */
>> +     __le64 otransid; /* trans when created */
>> +     __le64 stransid; /* trans when sent. non-zero for received subvol */
>> +     __le64 rtransid; /* trans when received. non-zero for received subvol */
>> +     struct btrfs_timespec ctime;
>> +     struct btrfs_timespec otime;
>> +     struct btrfs_timespec stime;
>> +     struct btrfs_timespec rtime;
>>  } __attribute__ ((__packed__));
>>
>>  /*
>> @@ -1416,6 +1445,8 @@ struct btrfs_root {
>>       dev_t anon_dev;
>>
>>       int force_cow;
>> +
>> +     spinlock_t root_times_lock;
>>  };
>>
>>  struct btrfs_ioctl_defrag_range_args {
>> @@ -2189,6 +2220,16 @@ BTRFS_SETGET_STACK_FUNCS(root_used, struct btrfs_root_item, bytes_used, 64);
>>  BTRFS_SETGET_STACK_FUNCS(root_limit, struct btrfs_root_item, byte_limit, 64);
>>  BTRFS_SETGET_STACK_FUNCS(root_last_snapshot, struct btrfs_root_item,
>>                        last_snapshot, 64);
>> +BTRFS_SETGET_STACK_FUNCS(root_generation_v2, struct btrfs_root_item,
>> +                      generation_v2, 64);
>> +BTRFS_SETGET_STACK_FUNCS(root_ctransid, struct btrfs_root_item,
>> +                      ctransid, 64);
>> +BTRFS_SETGET_STACK_FUNCS(root_otransid, struct btrfs_root_item,
>> +                      otransid, 64);
>> +BTRFS_SETGET_STACK_FUNCS(root_stransid, struct btrfs_root_item,
>> +                      stransid, 64);
>> +BTRFS_SETGET_STACK_FUNCS(root_rtransid, struct btrfs_root_item,
>> +                      rtransid, 64);
>>
>>  static inline bool btrfs_root_readonly(struct btrfs_root *root)
>>  {
>> @@ -2829,6 +2870,8 @@ int btrfs_find_orphan_roots(struct btrfs_root *tree_root);
>>  void btrfs_set_root_node(struct btrfs_root_item *item,
>>                        struct extent_buffer *node);
>>  void btrfs_check_and_init_root_item(struct btrfs_root_item *item);
>> +void btrfs_update_root_times(struct btrfs_trans_handle *trans,
>> +                          struct btrfs_root *root);
>>
>>  /* dir-item.c */
>>  int btrfs_insert_dir_item(struct btrfs_trans_handle *trans,
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 7b845ff..d3b49ad 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -1182,6 +1182,8 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
>>       root->defrag_running = 0;
>>       root->root_key.objectid = objectid;
>>       root->anon_dev = 0;
>> +
>> +     spin_lock_init(&root->root_times_lock);
>>  }
>>
>>  static int __must_check find_and_setup_root(struct btrfs_root *tree_root,
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 139be17..0f6a65d 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -2734,6 +2734,8 @@ noinline int btrfs_update_inode(struct btrfs_trans_handle *trans,
>>        */
>>       if (!btrfs_is_free_space_inode(root, inode)
>>           && root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID) {
>> +             btrfs_update_root_times(trans, root);
>> +
>>               ret = btrfs_delayed_update_inode(trans, root, inode);
>>               if (!ret)
>>                       btrfs_set_inode_last_trans(trans, inode);
>> @@ -4728,6 +4730,8 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans,
>>       trace_btrfs_inode_new(inode);
>>       btrfs_set_inode_last_trans(trans, inode);
>>
>> +     btrfs_update_root_times(trans, root);
>> +
>>       return inode;
>>  fail:
>>       if (dir)
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index 7011871..8d258cb 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -41,6 +41,7 @@
>>  #include <linux/vmalloc.h>
>>  #include <linux/slab.h>
>>  #include <linux/blkdev.h>
>> +#include <linux/uuid.h>
>>  #include "compat.h"
>>  #include "ctree.h"
>>  #include "disk-io.h"
>> @@ -346,11 +347,13 @@ static noinline int create_subvol(struct btrfs_root *root,
>>       struct btrfs_root *new_root;
>>       struct dentry *parent = dentry->d_parent;
>>       struct inode *dir;
>> +     struct timespec cur_time = CURRENT_TIME;
>>       int ret;
>>       int err;
>>       u64 objectid;
>>       u64 new_dirid = BTRFS_FIRST_FREE_OBJECTID;
>>       u64 index = 0;
>> +     uuid_le new_uuid;
>>
>>       ret = btrfs_find_free_objectid(root->fs_info->tree_root, &objectid);
>>       if (ret)
>> @@ -389,8 +392,9 @@ static noinline int create_subvol(struct btrfs_root *root,
>>                           BTRFS_UUID_SIZE);
>>       btrfs_mark_buffer_dirty(leaf);
>>
>> +     memset(&root_item, 0, sizeof(root_item));
>> +
>>       inode_item = &root_item.inode;
>> -     memset(inode_item, 0, sizeof(*inode_item));
>>       inode_item->generation = cpu_to_le64(1);
>>       inode_item->size = cpu_to_le64(3);
>>       inode_item->nlink = cpu_to_le32(1);
>> @@ -408,8 +412,15 @@ static noinline int create_subvol(struct btrfs_root *root,
>>       btrfs_set_root_used(&root_item, leaf->len);
>>       btrfs_set_root_last_snapshot(&root_item, 0);
>>
>> -     memset(&root_item.drop_progress, 0, sizeof(root_item.drop_progress));
>> -     root_item.drop_level = 0;
>> +     btrfs_set_root_generation_v2(&root_item,
>> +                     btrfs_root_generation(&root_item));
>> +     uuid_le_gen(&new_uuid);
>> +     memcpy(root_item.uuid, new_uuid.b, BTRFS_UUID_SIZE);
>> +     root_item.otime.sec = cpu_to_le64(cur_time.tv_sec);
>> +     root_item.otime.nsec = cpu_to_le64(cur_time.tv_nsec);
>> +     root_item.ctime = root_item.otime;
>> +     btrfs_set_root_ctransid(&root_item, trans->transid);
>> +     btrfs_set_root_otransid(&root_item, trans->transid);
>>
>>       btrfs_tree_unlock(leaf);
>>       free_extent_buffer(leaf);
>> @@ -3395,6 +3406,83 @@ out:
>>       return ret;
>>  }
>>
>> +static long btrfs_ioctl_set_received_subvol(struct file *file,
>> +                                         void __user *arg)
>> +{
>> +     struct btrfs_ioctl_received_subvol_args *sa = NULL;
>> +     struct inode *inode = fdentry(file)->d_inode;
>> +     struct btrfs_root *root = BTRFS_I(inode)->root;
>> +     struct btrfs_root_item *root_item = &root->root_item;
>> +     struct btrfs_trans_handle *trans;
>> +     int ret = 0;
>> +
>> +     ret = mnt_want_write_file(file);
>> +     if (ret < 0)
>> +             return ret;
>> +
>> +     down_write(&root->fs_info->subvol_sem);
>> +
>> +     if (btrfs_ino(inode) != BTRFS_FIRST_FREE_OBJECTID) {
>> +             ret = -EINVAL;
>> +             goto out;
>> +     }
>> +
>> +     if (btrfs_root_readonly(root)) {
>> +             ret = -EROFS;
>> +             goto out;
>> +     }
>> +
>> +     if (!inode_owner_or_capable(inode)) {
>> +             ret = -EACCES;
>> +             goto out;
>> +     }
>> +
>> +     sa = memdup_user(arg, sizeof(*sa));
>> +     if (IS_ERR(sa)) {
>> +             ret = PTR_ERR(sa);
>> +             sa = NULL;
>> +             goto out;
>> +     }
>> +
>> +     trans = btrfs_start_transaction(root, 1);
>> +     if (IS_ERR(trans)) {
>> +             ret = PTR_ERR(trans);
>> +             trans = NULL;
>> +             goto out;
>> +     }
>> +
>> +     sa->rtransid = trans->transid;
>> +     sa->rtime = CURRENT_TIME;
>> +
>> +     memcpy(root_item->received_uuid, sa->uuid, BTRFS_UUID_SIZE);
>> +     btrfs_set_root_stransid(root_item, sa->stransid);
>> +     btrfs_set_root_rtransid(root_item, sa->rtransid);
>> +     root_item->stime.sec = cpu_to_le64(sa->stime.tv_sec);
>> +     root_item->stime.nsec = cpu_to_le64(sa->stime.tv_nsec);
>> +     root_item->rtime.sec = cpu_to_le64(sa->rtime.tv_sec);
>> +     root_item->rtime.nsec = cpu_to_le64(sa->rtime.tv_nsec);
>> +
>> +     ret = btrfs_update_root(trans, root->fs_info->tree_root,
>> +                             &root->root_key, &root->root_item);
>> +     if (ret < 0) {
>> +             goto out;
>
> are you leaking a trans handle here?
>
btrfs_update_root is aborting the transaction in case of failure. Do I
still need to call end_transaction?
>> +     } else {
>> +             ret = btrfs_commit_transaction(trans, root);
>> +             if (ret < 0)
>> +                     goto out;
>> +     }
>> +
>> +     ret = copy_to_user(arg, sa, sizeof(*sa));
>> +     if (ret)
>> +             ret = -EFAULT;
>> +
>> +out:
>> +     kfree(sa);
>> +     up_write(&root->fs_info->subvol_sem);
>> +     mnt_drop_write_file(file);
>> +     return ret;
>> +}
>> +
>>  long btrfs_ioctl(struct file *file, unsigned int
>>               cmd, unsigned long arg)
>>  {
>> @@ -3477,6 +3565,8 @@ long btrfs_ioctl(struct file *file, unsigned int
>>               return btrfs_ioctl_balance_ctl(root, arg);
>>       case BTRFS_IOC_BALANCE_PROGRESS:
>>               return btrfs_ioctl_balance_progress(root, argp);
>> +     case BTRFS_IOC_SET_RECEIVED_SUBVOL:
>> +             return btrfs_ioctl_set_received_subvol(file, argp);
>>       case BTRFS_IOC_GET_DEV_STATS:
>>               return btrfs_ioctl_get_dev_stats(root, argp, 0);
>>       case BTRFS_IOC_GET_AND_RESET_DEV_STATS:
>> diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
>> index e440aa6..c9e3fac 100644
>> --- a/fs/btrfs/ioctl.h
>> +++ b/fs/btrfs/ioctl.h
>> @@ -295,6 +295,15 @@ struct btrfs_ioctl_get_dev_stats {
>>       __u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */
>>  };
>>
>> +struct btrfs_ioctl_received_subvol_args {
>> +     char    uuid[BTRFS_UUID_SIZE];  /* in */
>> +     __u64   stransid;               /* in */
>> +     __u64   rtransid;               /* out */
>> +     struct timespec stime;          /* in */
>> +     struct timespec rtime;          /* out */
>> +     __u64   reserved[16];
>
> What is this reserved used for? I don't see a mechanism that could be
> used to signal that there are useful information here, other than
> using a different ioctl.
>
The reserved is a result of a suggestion made by David. I can remove
it again if you want...
>> +};
>> +
>>  #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
>>                                  struct btrfs_ioctl_vol_args)
>>  #define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
>> @@ -359,6 +368,10 @@ struct btrfs_ioctl_get_dev_stats {
>>                                       struct btrfs_ioctl_ino_path_args)
>>  #define BTRFS_IOC_LOGICAL_INO _IOWR(BTRFS_IOCTL_MAGIC, 36, \
>>                                       struct btrfs_ioctl_ino_path_args)
>> +
>> +#define BTRFS_IOC_SET_RECEIVED_SUBVOL _IOWR(BTRFS_IOCTL_MAGIC, 37, \
>> +                             struct btrfs_ioctl_received_subvol_args)
>> +
>>  #define BTRFS_IOC_GET_DEV_STATS _IOWR(BTRFS_IOCTL_MAGIC, 52, \
>>                                     struct btrfs_ioctl_get_dev_stats)
>>  #define BTRFS_IOC_GET_AND_RESET_DEV_STATS _IOWR(BTRFS_IOCTL_MAGIC, 53, \
>> diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
>> index 24fb8ce..17d638e 100644
>> --- a/fs/btrfs/root-tree.c
>> +++ b/fs/btrfs/root-tree.c
>> @@ -16,6 +16,7 @@
>>   * Boston, MA 021110-1307, USA.
>>   */
>>
>> +#include <linux/uuid.h>
>>  #include "ctree.h"
>>  #include "transaction.h"
>>  #include "disk-io.h"
>> @@ -25,6 +26,9 @@
>>   * lookup the root with the highest offset for a given objectid.  The key we do
>>   * find is copied into 'key'.  If we find something return 0, otherwise 1, < 0
>>   * on error.
>> + * We also check if the root was once mounted with an older kernel. If we detect
>> + * this, the new fields coming after 'level' get overwritten with zeros so to
>> + * invalidate the fields.
>
> ... "This is detected by a mismatch of the 2 generation fields" ... or something
> like that.
>
The current version (found in git only) has this new function which is
called in find_last_root:
void btrfs_read_root_item(struct btrfs_root *root,
			 struct extent_buffer *eb, int slot,
			 struct btrfs_root_item *item)

The comment above this function explains what happens.

>>   */
>>  int btrfs_find_last_root(struct btrfs_root *root, u64 objectid,
>>                       struct btrfs_root_item *item, struct btrfs_key *key)
>> @@ -35,6 +39,9 @@ int btrfs_find_last_root(struct btrfs_root *root, u64 objectid,
>>       struct extent_buffer *l;
>>       int ret;
>>       int slot;
>> +     int len;
>> +     int need_reset = 0;
>> +     uuid_le uuid;
>>
>>       search_key.objectid = objectid;
>>       search_key.type = BTRFS_ROOT_ITEM_KEY;
>> @@ -60,11 +67,36 @@ int btrfs_find_last_root(struct btrfs_root *root, u64 objectid,
>>               ret = 1;
>>               goto out;
>>       }
>> -     if (item)
>> +     if (item) {
>> +             len = btrfs_item_size_nr(l, slot);
>>               read_extent_buffer(l, item, btrfs_item_ptr_offset(l, slot),
>> -                                sizeof(*item));
>> +                             min_t(int, len, (int)sizeof(*item)));
>> +             if (len < sizeof(*item))
>> +                     need_reset = 1;
>> +             if (!need_reset && btrfs_root_generation(item)
>> +                     != btrfs_root_generation_v2(item)) {
>> +                     if (btrfs_root_generation_v2(item) != 0) {
>> +                             printk(KERN_WARNING "btrfs: mismatching "
>> +                                             "generation and generation_v2 "
>> +                                             "found in root item. This root "
>> +                                             "was probably mounted with an "
>> +                                             "older kernel. Resetting all "
>> +                                             "new fields.\n");
>> +                     }
>> +                     need_reset = 1;
>> +             }
>> +             if (need_reset) {
>> +                     memset(&item->generation_v2, 0,
>> +                             sizeof(*item) - offsetof(struct btrfs_root_item,
>> +                                             generation_v2));
>> +
>> +                     uuid_le_gen(&uuid);
>> +                     memcpy(item->uuid, uuid.b, BTRFS_UUID_SIZE);
>> +             }
>> +     }
>>       if (key)
>>               memcpy(key, &found_key, sizeof(found_key));
>> +
>>       ret = 0;
>>  out:
>>       btrfs_free_path(path);
>> @@ -91,16 +123,15 @@ int btrfs_update_root(struct btrfs_trans_handle *trans, struct btrfs_root
>>       int ret;
>>       int slot;
>>       unsigned long ptr;
>> +     int old_len;
>>
>>       path = btrfs_alloc_path();
>>       if (!path)
>>               return -ENOMEM;
>>
>>       ret = btrfs_search_slot(trans, root, key, path, 0, 1);
>> -     if (ret < 0) {
>> -             btrfs_abort_transaction(trans, root, ret);
>> -             goto out;
>> -     }
>> +     if (ret < 0)
>> +             goto out_abort;
>>
>>       if (ret != 0) {
>>               btrfs_print_leaf(root, path->nodes[0]);
>> @@ -113,11 +144,47 @@ int btrfs_update_root(struct btrfs_trans_handle *trans, struct btrfs_root
>>       l = path->nodes[0];
>>       slot = path->slots[0];
>>       ptr = btrfs_item_ptr_offset(l, slot);
>> +     old_len = btrfs_item_size_nr(l, slot);
>> +
>> +     /*
>> +      * If this is the first time we update the root item which originated
>> +      * from an older kernel, we need to enlarge the item size to make room
>> +      * for the added fields.
>> +      */
>> +     if (old_len < sizeof(*item)) {
>> +             btrfs_release_path(path);
>> +             ret = btrfs_search_slot(trans, root, key, path,
>> +                             -1, 1);
>> +             if (ret < 0)
>> +                     goto out_abort;
>> +             ret = btrfs_del_item(trans, root, path);
>> +             if (ret < 0)
>> +                     goto out_abort;
>> +             btrfs_release_path(path);
>> +             ret = btrfs_insert_empty_item(trans, root, path,
>> +                             key, sizeof(*item));
>> +             if (ret < 0)
>> +                     goto out_abort;
>> +             l = path->nodes[0];
>> +             slot = path->slots[0];
>> +             ptr = btrfs_item_ptr_offset(l, slot);
>> +     }
>> +
>> +     /*
>> +      * Update generation_v2 so at the next mount we know the new root
>> +      * fields are valid.
>> +      */
>> +     btrfs_set_root_generation_v2(item, btrfs_root_generation(item));
>> +
>>       write_extent_buffer(l, item, ptr, sizeof(*item));
>>       btrfs_mark_buffer_dirty(path->nodes[0]);
>>  out:
>>       btrfs_free_path(path);
>>       return ret;
>> +
>> +out_abort:
>> +     btrfs_abort_transaction(trans, root, ret);
>> +     goto out;
>>  }
>>
>>  int btrfs_insert_root(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>> @@ -454,3 +521,16 @@ void btrfs_check_and_init_root_item(struct btrfs_root_item *root_item)
>>               root_item->byte_limit = 0;
>>       }
>>  }
>> +
>> +void btrfs_update_root_times(struct btrfs_trans_handle *trans,
>> +                          struct btrfs_root *root)
>> +{
>> +     struct btrfs_root_item *item = &root->root_item;
>> +     struct timespec ct = CURRENT_TIME;
>> +
>> +     spin_lock(&root->root_times_lock);
>> +     item->ctransid = trans->transid;
>> +     item->ctime.sec = cpu_to_le64(ct.tv_sec);
>> +     item->ctime.nsec = cpu_to_le64(ct.tv_nsec);
>> +     spin_unlock(&root->root_times_lock);
>> +}
>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>> index b72b068..a21f308 100644
>> --- a/fs/btrfs/transaction.c
>> +++ b/fs/btrfs/transaction.c
>> @@ -22,6 +22,7 @@
>>  #include <linux/writeback.h>
>>  #include <linux/pagemap.h>
>>  #include <linux/blkdev.h>
>> +#include <linux/uuid.h>
>>  #include "ctree.h"
>>  #include "disk-io.h"
>>  #include "transaction.h"
>> @@ -926,11 +927,13 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
>>       struct dentry *dentry;
>>       struct extent_buffer *tmp;
>>       struct extent_buffer *old;
>> +     struct timespec cur_time = CURRENT_TIME;
>>       int ret;
>>       u64 to_reserve = 0;
>>       u64 index = 0;
>>       u64 objectid;
>>       u64 root_flags;
>> +     uuid_le new_uuid;
>>
>>       rsv = trans->block_rsv;
>>
>> @@ -1016,6 +1019,20 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
>>               root_flags &= ~BTRFS_ROOT_SUBVOL_RDONLY;
>>       btrfs_set_root_flags(new_root_item, root_flags);
>>
>> +     btrfs_set_root_generation_v2(new_root_item,
>> +                     trans->transid);
>> +     uuid_le_gen(&new_uuid);
>> +     memcpy(new_root_item->uuid, new_uuid.b, BTRFS_UUID_SIZE);
>> +     memcpy(new_root_item->parent_uuid, root->root_item.uuid,
>> +                     BTRFS_UUID_SIZE);
>> +     new_root_item->otime.sec = cpu_to_le64(cur_time.tv_sec);
>> +     new_root_item->otime.nsec = cpu_to_le64(cur_time.tv_nsec);
>> +     btrfs_set_root_otransid(new_root_item, trans->transid);
>> +     memset(&new_root_item->stime, 0, sizeof(new_root_item->stime));
>> +     memset(&new_root_item->rtime, 0, sizeof(new_root_item->rtime));
>> +     btrfs_set_root_stransid(new_root_item, 0);
>> +     btrfs_set_root_rtransid(new_root_item, 0);
>> +
>>       old = btrfs_lock_root_node(root);
>>       ret = btrfs_cow_block(trans, root, old, NULL, 0, &old);
>>       if (ret) {
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 4/7] Btrfs: introduce subvol uuids and times
  2012-07-23 19:41     ` Alexander Block
@ 2012-07-24  5:55       ` Arne Jansen
  2012-07-25 10:51         ` Alexander Block
  0 siblings, 1 reply; 43+ messages in thread
From: Arne Jansen @ 2012-07-24  5:55 UTC (permalink / raw)
  To: Alexander Block; +Cc: linux-btrfs

On 23.07.2012 21:41, Alexander Block wrote:
> On Mon, Jul 16, 2012 at 4:56 PM, Arne Jansen <sensille@gmx.net> wrote:
>> On 04.07.2012 15:38, Alexander Block wrote:
>>> +
>>> +     ret = btrfs_update_root(trans, root->fs_info->tree_root,
>>> +                             &root->root_key, &root->root_item);
>>> +     if (ret < 0) {
>>> +             goto out;
>>
>> are you leaking a trans handle here?
>>
> btrfs_update_root is aborting the transaction in case of failure. Do I
> still need to call end_transaction?

It's your handle, you should free it.

...

>>>
>>> +struct btrfs_ioctl_received_subvol_args {
>>> +     char    uuid[BTRFS_UUID_SIZE];  /* in */
>>> +     __u64   stransid;               /* in */
>>> +     __u64   rtransid;               /* out */
>>> +     struct timespec stime;          /* in */
>>> +     struct timespec rtime;          /* out */
>>> +     __u64   reserved[16];
>>
>> What is this reserved used for? I don't see a mechanism that could be
>> used to signal that there are useful information here, other than
>> using a different ioctl.
>>
> The reserved is a result of a suggestion made by David. I can remove
> it again if you want...

I don't argue against some reserved space, I only have problems to
see how you can make use of them in the future when there's no way
to signal that they contain valid information. I should be sufficient
to define the reserved values to be 0 at the moment.

>>> +};
>>> +
>>>  #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
>>>                                  struct btrfs_ioctl_vol_args)
>>>  #define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
>>> @@ -359,6 +368,10 @@ struct btrfs_ioctl_get_dev_stats {
>>>                                       struct btrfs_ioctl_ino_path_args)
>>>  #define BTRFS_IOC_LOGICAL_INO _IOWR(BTRFS_IOCTL_MAGIC, 36, \
>>>                                       struct btrfs_ioctl_ino_path_args)
>>> +
>>> +#define BTRFS_IOC_SET_RECEIVED_SUBVOL _IOWR(BTRFS_IOCTL_MAGIC, 37, \
>>> +                             struct btrfs_ioctl_received_subvol_args)
>>> +
>>>  #define BTRFS_IOC_GET_DEV_STATS _IOWR(BTRFS_IOCTL_MAGIC, 52, \
>>>                                     struct btrfs_ioctl_get_dev_stats)
>>>  #define BTRFS_IOC_GET_AND_RESET_DEV_STATS _IOWR(BTRFS_IOCTL_MAGIC, 53, \
>>> diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
>>> index 24fb8ce..17d638e 100644
>>> --- a/fs/btrfs/root-tree.c
>>> +++ b/fs/btrfs/root-tree.c
>>> @@ -16,6 +16,7 @@
>>>   * Boston, MA 021110-1307, USA.
>>>   */
>>>
>>> +#include <linux/uuid.h>
>>>  #include "ctree.h"
>>>  #include "transaction.h"
>>>  #include "disk-io.h"
>>> @@ -25,6 +26,9 @@
>>>   * lookup the root with the highest offset for a given objectid.  The key we do
>>>   * find is copied into 'key'.  If we find something return 0, otherwise 1, < 0
>>>   * on error.
>>> + * We also check if the root was once mounted with an older kernel. If we detect
>>> + * this, the new fields coming after 'level' get overwritten with zeros so to
>>> + * invalidate the fields.
>>
>> ... "This is detected by a mismatch of the 2 generation fields" ... or something
>> like that.
>>
> The current version (found in git only) has this new function which is
> called in find_last_root:
> void btrfs_read_root_item(struct btrfs_root *root,
> 			 struct extent_buffer *eb, int slot,
> 			 struct btrfs_root_item *item)
> 
> The comment above this function explains what happens.

ok. Please regard most of my comments as an expression of my thoughts while
reading it. So they mark places where it might be useful to add comments
to make it easier for the next reader :)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 4/7] Btrfs: introduce subvol uuids and times
  2012-07-24  5:55       ` Arne Jansen
@ 2012-07-25 10:51         ` Alexander Block
  0 siblings, 0 replies; 43+ messages in thread
From: Alexander Block @ 2012-07-25 10:51 UTC (permalink / raw)
  To: Arne Jansen; +Cc: linux-btrfs

On Tue, Jul 24, 2012 at 7:55 AM, Arne Jansen <sensille@gmx.net> wrote:
> On 23.07.2012 21:41, Alexander Block wrote:
>> On Mon, Jul 16, 2012 at 4:56 PM, Arne Jansen <sensille@gmx.net> wrote:
>>> On 04.07.2012 15:38, Alexander Block wrote:
>>>> +
>>>> +     ret = btrfs_update_root(trans, root->fs_info->tree_root,
>>>> +                             &root->root_key, &root->root_item);
>>>> +     if (ret < 0) {
>>>> +             goto out;
>>>
>>> are you leaking a trans handle here?
>>>
>> btrfs_update_root is aborting the transaction in case of failure. Do I
>> still need to call end_transaction?
>
> It's your handle, you should free it.
Ahh...I thought abort_transaction already frees the handle. Fixed.
>
> ...
>
>>>>
>>>> +struct btrfs_ioctl_received_subvol_args {
>>>> +     char    uuid[BTRFS_UUID_SIZE];  /* in */
>>>> +     __u64   stransid;               /* in */
>>>> +     __u64   rtransid;               /* out */
>>>> +     struct timespec stime;          /* in */
>>>> +     struct timespec rtime;          /* out */
>>>> +     __u64   reserved[16];
>>>
>>> What is this reserved used for? I don't see a mechanism that could be
>>> used to signal that there are useful information here, other than
>>> using a different ioctl.
>>>
>> The reserved is a result of a suggestion made by David. I can remove
>> it again if you want...
>
> I don't argue against some reserved space, I only have problems to
> see how you can make use of them in the future when there's no way
> to signal that they contain valid information. I should be sufficient
> to define the reserved values to be 0 at the moment.
Misunderstood that. Now I see the problem :) I've added a flags field
before the reserved fields. It's unused for now but can later be used
to signal that new fields are used.
>
>>>> +};
>>>> +
>>>>  #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
>>>>                                  struct btrfs_ioctl_vol_args)
>>>>  #define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
>>>> @@ -359,6 +368,10 @@ struct btrfs_ioctl_get_dev_stats {
>>>>                                       struct btrfs_ioctl_ino_path_args)
>>>>  #define BTRFS_IOC_LOGICAL_INO _IOWR(BTRFS_IOCTL_MAGIC, 36, \
>>>>                                       struct btrfs_ioctl_ino_path_args)
>>>> +
>>>> +#define BTRFS_IOC_SET_RECEIVED_SUBVOL _IOWR(BTRFS_IOCTL_MAGIC, 37, \
>>>> +                             struct btrfs_ioctl_received_subvol_args)
>>>> +
>>>>  #define BTRFS_IOC_GET_DEV_STATS _IOWR(BTRFS_IOCTL_MAGIC, 52, \
>>>>                                     struct btrfs_ioctl_get_dev_stats)
>>>>  #define BTRFS_IOC_GET_AND_RESET_DEV_STATS _IOWR(BTRFS_IOCTL_MAGIC, 53, \
>>>> diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
>>>> index 24fb8ce..17d638e 100644
>>>> --- a/fs/btrfs/root-tree.c
>>>> +++ b/fs/btrfs/root-tree.c
>>>> @@ -16,6 +16,7 @@
>>>>   * Boston, MA 021110-1307, USA.
>>>>   */
>>>>
>>>> +#include <linux/uuid.h>
>>>>  #include "ctree.h"
>>>>  #include "transaction.h"
>>>>  #include "disk-io.h"
>>>> @@ -25,6 +26,9 @@
>>>>   * lookup the root with the highest offset for a given objectid.  The key we do
>>>>   * find is copied into 'key'.  If we find something return 0, otherwise 1, < 0
>>>>   * on error.
>>>> + * We also check if the root was once mounted with an older kernel. If we detect
>>>> + * this, the new fields coming after 'level' get overwritten with zeros so to
>>>> + * invalidate the fields.
>>>
>>> ... "This is detected by a mismatch of the 2 generation fields" ... or something
>>> like that.
>>>
>> The current version (found in git only) has this new function which is
>> called in find_last_root:
>> void btrfs_read_root_item(struct btrfs_root *root,
>>                        struct extent_buffer *eb, int slot,
>>                        struct btrfs_root_item *item)
>>
>> The comment above this function explains what happens.
>
> ok. Please regard most of my comments as an expression of my thoughts while
> reading it. So they mark places where it might be useful to add comments
> to make it easier for the next reader :)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 7/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 2)
  2012-07-10 15:26   ` Alex Lyakas
@ 2012-07-25 13:37     ` Alexander Block
  2012-07-25 17:20       ` Alex Lyakas
  0 siblings, 1 reply; 43+ messages in thread
From: Alexander Block @ 2012-07-25 13:37 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs

On Tue, Jul 10, 2012 at 5:26 PM, Alex Lyakas
<alex.bolshoy.btrfs@gmail.com> wrote:
> Alexander,
> this focuses on area of sending file extents:
>
>> +static int is_extent_unchanged(struct send_ctx *sctx,
>> +                              struct btrfs_path *left_path,
>> +                              struct btrfs_key *ekey)
>> +{
>> +       int ret = 0;
>> +       struct btrfs_key key;
>> +       struct btrfs_path *path = NULL;
>> +       struct extent_buffer *eb;
>> +       int slot;
>> +       struct btrfs_key found_key;
>> +       struct btrfs_file_extent_item *ei;
>> +       u64 left_disknr;
>> +       u64 right_disknr;
>> +       u64 left_offset;
>> +       u64 right_offset;
>> +       u64 left_len;
>> +       u64 right_len;
>> +       u8 left_type;
>> +       u8 right_type;
>> +
>> +       path = alloc_path_for_send();
>> +       if (!path)
>> +               return -ENOMEM;
>> +
>> +       eb = left_path->nodes[0];
>> +       slot = left_path->slots[0];
>> +
>> +       ei = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item);
>> +       left_type = btrfs_file_extent_type(eb, ei);
>> +       left_disknr = btrfs_file_extent_disk_bytenr(eb, ei);
>> +       left_len = btrfs_file_extent_num_bytes(eb, ei);
>> +       left_offset = btrfs_file_extent_offset(eb, ei);
>> +
>> +       if (left_type != BTRFS_FILE_EXTENT_REG) {
>> +               ret = 0;
>> +               goto out;
>> +       }
>> +
>> +       key.objectid = ekey->objectid;
>> +       key.type = BTRFS_EXTENT_DATA_KEY;
>> +       key.offset = ekey->offset;
>> +
>> +       while (1) {
>> +               ret = btrfs_search_slot_for_read(sctx->parent_root, &key, path,
>> +                               0, 0);
>> +               if (ret < 0)
>> +                       goto out;
>> +               if (ret) {
>> +                       ret = 0;
>> +                       goto out;
>> +               }
>> +               btrfs_item_key_to_cpu(path->nodes[0], &found_key,
>> +                               path->slots[0]);
>> +               if (found_key.objectid != key.objectid ||
>> +                   found_key.type != key.type) {
>> +                       ret = 0;
>> +                       goto out;
>> +               }
>> +
>> +               eb = path->nodes[0];
>> +               slot = path->slots[0];
>> +
>> +               ei = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item);
>> +               right_type = btrfs_file_extent_type(eb, ei);
>> +               right_disknr = btrfs_file_extent_disk_bytenr(eb, ei);
>> +               right_len = btrfs_file_extent_num_bytes(eb, ei);
>> +               right_offset = btrfs_file_extent_offset(eb, ei);
>> +               btrfs_release_path(path);
>> +
>> +               if (right_type != BTRFS_FILE_EXTENT_REG) {
>> +                       ret = 0;
>> +                       goto out;
>> +               }
>> +
>> +               if (left_disknr != right_disknr) {
>> +                       ret = 0;
>> +                       goto out;
>> +               }
>> +
>> +               key.offset = found_key.offset + right_len;
>> +               if (key.offset >= ekey->offset + left_len) {
>> +                       ret = 1;
>> +                       goto out;
>> +               }
>> +       }
>> +
>> +out:
>> +       btrfs_free_path(path);
>> +       return ret;
>> +}
>> +
>
> Should we always treat left extent with bytenr==0 as not changed?
No, as we may have bytenr!=0 on the right side.
> Because right now, it simply reads and sends data of such extent,
> while bytenr==0 means "no data allocated here". Since we always do
> send_truncate() afterwards, file size will always be correct, so we
> can just skip bytenr==0 extents.
This is something that could be done for full sends only. Full sends
however do not call is_extent_unchanged, so the optimization is
something for process_extent.
In the incremental case, it may happen that left_disknr==0 and
right_disknr!=0 or vice versa, so we need to do the compare no matter
if one of them is ==0. process_extents could then again do some
optimization and send a special command to instruct a preallocated
zero block.
> Same is true for BTRFS_FILE_EXTENT_PREALLOC extents, I think. Those
> also don't contain real data.
> So something like:
> if (left_disknr == 0 || left_type == BTRFS_FILE_EXTENT_REG) {
>         ret = 1;
>         goto out;
> }
Do you mean "|| left_type == BTRFS_FILE_EXTENT_PREALLOC"?
> before we check for BTRFS_FILE_EXTENT_REG.
>
> Now I have a question about the rest of the logic that decides that
> extent is unchanged. I understand that if we see the same extent (same
> disk_bytenr) shared between parent_root and send_root, then it must
> contain the same data, even in nodatacow mode, because on a first
> write to such shared extent, it is cow'ed even with nodatacow.
>
> However, shouldn't we check btrfs_file_extent_offset(), to make sure
> that both send_root and parent_root point at the same offset into
> extent from the same file offset? Because if extent_offset values are
> different, then the data of the file might different, even though we
> are talking about the same extent.
>
> So I am thinking about something like:
>
> - ekey.offset points at data at logical address
> left_disknr+left_offset (logical address within CHUNK_ITEM address
> space) for left_len bytes
> - found_key.offset points at data at logical address
> right_disknr+right_offset for right_len
> - we know that found_key.offset <= ekey.offset
>
> So we need to ensure that left_disknr==right_disknr and also:
> right_disknr+right_offset + (ekey.offset - found_key.offset) ==
> left_disknr+left_offset
> or does this while loop somehow ensures this equation?
Ay...you're absolutely right :) Fixed that and pushing later today.
>
> However, I must admit I don't fully understand the logic behind
> deciding that extent is unchanged. Can you pls explain what this tries
> to accomplish, and why it decides that extent is unchanged here:
> key.offset = found_key.offset + right_len;
This line is to advance to the next extent on the right side.
> if (key.offset >= ekey->offset + left_len) {
>         ret = 1;
>         goto out;
> }
This if checks if the advancing would put us behind the left extent.
If that is true, we're done with the extent that we're checking now.
As we did not bail out before, we know that the extent is unchanged.
>
> Also: when searching for the next extent, should we use
> btrfs_file_extent_num_bytes() or btrfs_file_extent_disk_num_bytes()?
> They are not equal sometimes...not sure at which offset the next
> extent (if any) should be. What about holes in files? Then we will
> have non-consecutive offsets.
We have to use num_bytes, as it is the *uncompressed* number of bytes.
We're working on file extents, and their offsets are always
uncompressed. Also, num_bytes is the number of bytes after splitting,
while disk_num_bytes is always the size of the whole extent on disk.

I have changed the way I iterate the extents now. I use
btrfs_next_item instead of advancing key.offset now. Also, I have
added some ASCII graphics to illustrate what happens. I hope that
helps understanding this. Will push that later today.
>
> Thanks,
> Alex.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 7/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 2)
  2012-07-25 13:37     ` Alexander Block
@ 2012-07-25 17:20       ` Alex Lyakas
  2012-07-25 17:41         ` Alexander Block
  0 siblings, 1 reply; 43+ messages in thread
From: Alex Lyakas @ 2012-07-25 17:20 UTC (permalink / raw)
  To: Alexander Block; +Cc: linux-btrfs

Alexander,

>> Same is true for BTRFS_FILE_EXTENT_PREALLOC extents, I think. Those
>> also don't contain real data.
>> So something like:
>> if (left_disknr == 0 || left_type == BTRFS_FILE_EXTENT_REG) {
>>         ret = 1;
>>         goto out;
>> }
> Do you mean "|| left_type == BTRFS_FILE_EXTENT_PREALLOC"?

I see your point about bytenr==0, I missed that on the parent tree it
can be something else.

As for PREALLOC: can it happen that on differential send we see extent
of type BTRFS_FILE_EXTENT_PREALLOC? And can it happen that parent had
some real data extent in that place? I don't know the answer, but if
yes, then we must treat PREALLOC as normal extent. So this case is
similar to bytenr==0.

Thanks,
Alex.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 6/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 1)
  2012-07-18  6:59   ` Arne Jansen
@ 2012-07-25 17:33     ` Alexander Block
  0 siblings, 0 replies; 43+ messages in thread
From: Alexander Block @ 2012-07-25 17:33 UTC (permalink / raw)
  To: Arne Jansen; +Cc: linux-btrfs

Thanks for the review :)

On 07/18/2012 08:59 AM, Arne Jansen wrote:
> On 04.07.2012 15:38, Alexander Block wrote:
>> This patch introduces the BTRFS_IOC_SEND ioctl that is
>> required for send. It allows btrfs-progs to implement
>> full and incremental sends. Patches for btrfs-progs will
>> follow.
>>
>> I had to split the patch as it got larger then 100k which is
>> the limit for the mailing list. The first part only contains
>> the send.h header and the helper functions for TLV handling
>> and long path name handling and some other helpers. The second
>> part contains the actual send logic from send.c
>>
>> Signed-off-by: Alexander Block<ablock84@googlemail.com>
>> ---
>>   fs/btrfs/Makefile |    2 +-
>>   fs/btrfs/ioctl.h  |   10 +
>>   fs/btrfs/send.c   | 1009 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/send.h   |  126 +++++++
>>   4 files changed, 1146 insertions(+), 1 deletion(-)
>>   create mode 100644 fs/btrfs/send.c
>>   create mode 100644 fs/btrfs/send.h
>>
>> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
>> index 0c4fa2b..f740644 100644
>> --- a/fs/btrfs/Makefile
>> +++ b/fs/btrfs/Makefile
>> @@ -8,7 +8,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
>>   	   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
>>   	   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
>>   	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
>> -	   reada.o backref.o ulist.o
>> +	   reada.o backref.o ulist.o send.o
>>
>>   btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>>   btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
>> diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
>> index c9e3fac..282bc64 100644
>> --- a/fs/btrfs/ioctl.h
>> +++ b/fs/btrfs/ioctl.h
>> @@ -304,6 +304,15 @@ struct btrfs_ioctl_received_subvol_args {
>>   	__u64	reserved[16];
>>   };
>>
>> +struct btrfs_ioctl_send_args {
>> +	__s64 send_fd;			/* in */
>> +	__u64 clone_sources_count;	/* in */
>> +	__u64 __user *clone_sources;	/* in */
>> +	__u64 parent_root;		/* in */
>> +	__u64 flags;			/* in */
>> +	__u64 reserved[4];		/* in */
>> +};
>> +
>>   #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
>>   				   struct btrfs_ioctl_vol_args)
>>   #define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
>> @@ -371,6 +380,7 @@ struct btrfs_ioctl_received_subvol_args {
>>
>>   #define BTRFS_IOC_SET_RECEIVED_SUBVOL _IOWR(BTRFS_IOCTL_MAGIC, 37, \
>>   				struct btrfs_ioctl_received_subvol_args)
>> +#define BTRFS_IOC_SEND _IOW(BTRFS_IOCTL_MAGIC, 38, struct btrfs_ioctl_send_args)
>>
>>   #define BTRFS_IOC_GET_DEV_STATS _IOWR(BTRFS_IOCTL_MAGIC, 52, \
>>   				      struct btrfs_ioctl_get_dev_stats)
>> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
>> new file mode 100644
>> index 0000000..47a2557
>> --- /dev/null
>> +++ b/fs/btrfs/send.c
>> @@ -0,0 +1,1009 @@
>> +/*
>> + * Copyright (C) 2012 Alexander Block.  All rights reserved.
>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU General Public
>> + * License v2 as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> + * General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public
>> + * License along with this program; if not, write to the
>> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
>> + * Boston, MA 021110-1307, USA.
>> + */
>> +
>> +#include<linux/bsearch.h>
>> +#include<linux/fs.h>
>> +#include<linux/file.h>
>> +#include<linux/sort.h>
>> +#include<linux/mount.h>
>> +#include<linux/xattr.h>
>> +#include<linux/posix_acl_xattr.h>
>> +#include<linux/radix-tree.h>
>> +#include<linux/crc32c.h>
>> +
>> +#include "send.h"
>> +#include "backref.h"
>> +#include "locking.h"
>> +#include "disk-io.h"
>> +#include "btrfs_inode.h"
>> +#include "transaction.h"
>> +
>> +static int g_verbose = 0;
>> +
>> +#define verbose_printk(...) if (g_verbose) printk(__VA_ARGS__)
>
> Maybe pr_debug is interesting to you.
>
The advantage of this solution was that I could enable verbose output 
while I was in the debugger and single stepping. Not sure if I could do 
that with pr_debug. When send/receive gets stable and less debugging is 
needed, we can change this to pr_debug.
>> +
>> +/*
>> + * A fs_path is a helper to dynamically build path names with unknown size.
>> + * It reallocates the internal buffer on demand.
>> + * It allows fast adding of path elements on the right side (normal path) and
>> + * fast adding to the left side (reversed path). A reversed path can also be
>> + * unreversed if needed.
>> + */
>> +struct fs_path {
>> +	union {
>> +		struct {
>> +			char *start;
>> +			char *end;
>> +			char *prepared;
>> +
>> +			char *buf;
>> +			int buf_len;
>> +			int reversed:1;
>> +			int virtual_mem:1;
>
> s/int/unsigned int/
>
Changed for the bitfields but not for buf_len.
>> +			char inline_buf[];
>> +		};
>> +		char pad[PAGE_SIZE];
>> +	};
>> +};
>> +#define FS_PATH_INLINE_SIZE \
>> +	(sizeof(struct fs_path) - offsetof(struct fs_path, inline_buf))
>> +
>> +
>> +/* reused for each extent */
>> +struct clone_root {
>> +	struct btrfs_root *root;
>> +	u64 ino;
>> +	u64 offset;
>> +
>> +	u64 found_refs;
>> +};
>> +
>> +#define SEND_CTX_MAX_NAME_CACHE_SIZE 128
>> +#define SEND_CTX_NAME_CACHE_CLEAN_SIZE (SEND_CTX_MAX_NAME_CACHE_SIZE * 2)
>> +
>> +struct send_ctx {
>> +	struct file *send_filp;
>> +	loff_t send_off;
>> +	char *send_buf;
>> +	u32 send_size;
>> +	u32 send_max_size;
>> +	u64 total_send_size;
>> +	u64 cmd_send_size[BTRFS_SEND_C_MAX + 1];
>> +
>> +	struct vfsmount *mnt;
>> +
>> +	struct btrfs_root *send_root;
>> +	struct btrfs_root *parent_root;
>> +	struct clone_root *clone_roots;
>> +	int clone_roots_cnt;
>> +
>> +	/* current state of the compare_tree call */
>> +	struct btrfs_path *left_path;
>> +	struct btrfs_path *right_path;
>> +	struct btrfs_key *cmp_key;
>> +
>> +	/*
>> +	 * infos of the currently processed inode. In case of deleted inodes,
>> +	 * these are the values from the deleted inode.
>> +	 */
>> +	u64 cur_ino;
>> +	u64 cur_inode_gen;
>> +	int cur_inode_new;
>> +	int cur_inode_new_gen;
>> +	int cur_inode_deleted;
>> +	u64 cur_inode_size;
>> +	u64 cur_inode_mode;
>> +
>> +	u64 send_progress;
>> +
>> +	struct list_head new_refs;
>> +	struct list_head deleted_refs;
>> +
>> +	struct radix_tree_root name_cache;
>> +	struct list_head name_cache_list;
>> +	int name_cache_size;
>> +
>> +	struct file *cur_inode_filp;
>> +	char *read_buf;
>> +};
>> +
>> +struct name_cache_entry {
>> +	struct list_head list;
>> +	struct list_head use_list;
>> +	u64 ino;
>> +	u64 gen;
>> +	u64 parent_ino;
>> +	u64 parent_gen;
>> +	int ret;
>> +	int need_later_update;
>> +	int name_len;
>> +	char name[];
>> +};
>> +
>> +static void fs_path_reset(struct fs_path *p)
>> +{
>> +	if (p->reversed) {
>> +		p->start = p->buf + p->buf_len - 1;
>> +		p->end = p->start;
>> +		*p->start = 0;
>> +	} else {
>> +		p->start = p->buf;
>> +		p->end = p->start;
>> +		*p->start = 0;
>> +	}
>> +}
>> +
>> +static struct fs_path *fs_path_alloc(struct send_ctx *sctx)
>
> parameter unused.
>
The parameter is for later use as I planned to implement a caching 
mechanism for fs_path.
>> +{
>> +	struct fs_path *p;
>> +
>> +	p = kmalloc(sizeof(*p), GFP_NOFS);
>> +	if (!p)
>> +		return NULL;
>> +	p->reversed = 0;
>> +	p->virtual_mem = 0;
>> +	p->buf = p->inline_buf;
>> +	p->buf_len = FS_PATH_INLINE_SIZE;
>> +	fs_path_reset(p);
>> +	return p;
>> +}
>> +
>> +static struct fs_path *fs_path_alloc_reversed(struct send_ctx *sctx)
>
> ditto.
>
>> +{
>> +	struct fs_path *p;
>> +
>> +	p = fs_path_alloc(sctx);
>> +	if (!p)
>> +		return NULL;
>> +	p->reversed = 1;
>> +	fs_path_reset(p);
>> +	return p;
>> +}
>> +
>> +static void fs_path_free(struct send_ctx *sctx, struct fs_path *p)
>
> ditto, sctx unused.
>
>> +{
>> +	if (!p)
>> +		return;
>> +	if (p->buf != p->inline_buf) {
>> +		if (p->virtual_mem)
>> +			vfree(p->buf);
>> +		else
>> +			kfree(p->buf);
>> +	}
>> +	kfree(p);
>> +}
>> +
>> +static int fs_path_len(struct fs_path *p)
>> +{
>> +	return p->end - p->start;
>> +}
>> +
>> +static int fs_path_ensure_buf(struct fs_path *p, int len)
>> +{
>> +	char *tmp_buf;
>> +	int path_len;
>> +	int old_buf_len;
>> +
>> +	len++;
>
> This looks a bit unmotivated, what is it for? It might be clearer
> to add it to the calling site, if it has something to do with
> 0-termination.
>
There would be a lot of places where I would have to take care of 
0-termination so I decided to put it here.
>> +
>> +	if (p->buf_len>= len)
>> +		return 0;
>> +
>> +	path_len = p->end - p->start;
>> +	old_buf_len = p->buf_len;
>> +	len = PAGE_ALIGN(len);
>> +
>> +	if (p->buf == p->inline_buf) {
>> +		tmp_buf = kmalloc(len, GFP_NOFS);
>> +		if (!tmp_buf) {
>> +			tmp_buf = vmalloc(len);
>
> have you tested this path?
>
Yepp. I forced it to use vmalloc for testing.
>> +			if (!tmp_buf)
>> +				return -ENOMEM;
>> +			p->virtual_mem = 1;
>> +		}
>> +		memcpy(tmp_buf, p->buf, p->buf_len);
>> +		p->buf = tmp_buf;
>> +		p->buf_len = len;
>> +	} else {
>> +		if (p->virtual_mem) {
>> +			tmp_buf = vmalloc(len);
>> +			if (!tmp_buf)
>> +				return -ENOMEM;
>> +			memcpy(tmp_buf, p->buf, p->buf_len);
>> +			vfree(p->buf);
>> +		} else {
>> +			tmp_buf = krealloc(p->buf, len, GFP_NOFS);
>> +			if (!tmp_buf) {
>> +				tmp_buf = vmalloc(len);
>> +				if (!tmp_buf)
>> +					return -ENOMEM;
>> +				memcpy(tmp_buf, p->buf, p->buf_len);
>> +				kfree(p->buf);
>> +				p->virtual_mem = 1;
>> +			}
>> +		}
>> +		p->buf = tmp_buf;
>> +		p->buf_len = len;
>> +	}
>> +	if (p->reversed) {
>> +		tmp_buf = p->buf + old_buf_len - path_len - 1;
>> +		p->end = p->buf + p->buf_len - 1;
>> +		p->start = p->end - path_len;
>> +		memmove(p->start, tmp_buf, path_len + 1);
>
> First you copy it, then you move it again? There's room for optimization
> here ;)
>
Yeah...but for now I tried to avoid dozens of duplicated lines. 
Something that could be optimized later.
>> +	} else {
>> +		p->start = p->buf;
>> +		p->end = p->start + path_len;
>> +	}
>> +	return 0;
>> +}
>> +
>> +static int fs_path_prepare_for_add(struct fs_path *p, int name_len)
>> +{
>> +	int ret;
>> +	int new_len;
>> +
>> +	new_len = p->end - p->start + name_len;
>> +	if (p->start != p->end)
>> +		new_len++;
>> +	ret = fs_path_ensure_buf(p, new_len);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	if (p->reversed) {
>> +		if (p->start != p->end)
>> +			*--p->start = '/';
>> +		p->start -= name_len;
>> +		p->prepared = p->start;
>> +	} else {
>> +		if (p->start != p->end)
>> +			*p->end++ = '/';
>> +		p->prepared = p->end;
>> +		p->end += name_len;
>> +		*p->end = 0;
>> +	}
>> +
>> +out:
>> +	return ret;
>> +}
>> +
>> +static int fs_path_add(struct fs_path *p, char *name, int name_len)
>> +{
>> +	int ret;
>> +
>> +	ret = fs_path_prepare_for_add(p, name_len);
>> +	if (ret<  0)
>> +		goto out;
>> +	memcpy(p->prepared, name, name_len);
>> +	p->prepared = NULL;
>> +
>> +out:
>> +	return ret;
>> +}
>> +
>> +static int fs_path_add_path(struct fs_path *p, struct fs_path *p2)
>> +{
>> +	int ret;
>> +
>> +	ret = fs_path_prepare_for_add(p, p2->end - p2->start);
>> +	if (ret<  0)
>> +		goto out;
>> +	memcpy(p->prepared, p2->start, p2->end - p2->start);
>> +	p->prepared = NULL;
>> +
>> +out:
>> +	return ret;
>> +}
>> +
>> +static int fs_path_add_from_extent_buffer(struct fs_path *p,
>> +					  struct extent_buffer *eb,
>> +					  unsigned long off, int len)
>> +{
>> +	int ret;
>> +
>> +	ret = fs_path_prepare_for_add(p, len);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	read_extent_buffer(eb, p->prepared, off, len);
>> +	p->prepared = NULL;
>> +
>> +out:
>> +	return ret;
>> +}
>> +
>> +static int fs_path_copy(struct fs_path *p, struct fs_path *from)
>> +{
>> +	int ret;
>> +
>> +	p->reversed = from->reversed;
>> +	fs_path_reset(p);
>> +
>> +	ret = fs_path_add_path(p, from);
>> +
>> +	return ret;
>> +}
>> +
>> +
>> +static void fs_path_unreverse(struct fs_path *p)
>> +{
>> +	char *tmp;
>> +	int len;
>> +
>> +	if (!p->reversed)
>> +		return;
>> +
>> +	tmp = p->start;
>> +	len = p->end - p->start;
>> +	p->start = p->buf;
>> +	p->end = p->start + len;
>> +	memmove(p->start, tmp, len + 1);
>> +	p->reversed = 0;
>
> oh, reversed doesn't mean that the path is reversed, but only that
> you are prepending components? Otherwise this function doesn't look
> like it would reverse anything. So maybe something with 'prepend' in
> it might be a better name than 'reverse' for all occurrences of
> 'reverse' above.
Right, the "reversed" path is for cases were prepending is done instead 
of appending, for example inode/ref to path resolution. Hmm...I'll call 
the flag in struct fs_path "for_prepend" and add a parameter to 
fs_path_alloc called "want_prepend". I'll also rename the unreverse 
function to "fs_path_want_prepend" which can be used to enable or 
disable prepending. I hope this is clearer.
>
>> +}
>> +
>> +static struct btrfs_path *alloc_path_for_send(void)
>> +{
>> +	struct btrfs_path *path;
>> +
>> +	path = btrfs_alloc_path();
>> +	if (!path)
>> +		return NULL;
>> +	path->search_commit_root = 1;
>> +	path->skip_locking = 1;
>> +	return path;
>> +}
>> +
>> +static int write_buf(struct send_ctx *sctx, const void *buf, u32 len)
>> +{
>> +	int ret;
>> +	mm_segment_t old_fs;
>> +	u32 pos = 0;
>> +
>> +	old_fs = get_fs();
>> +	set_fs(KERNEL_DS);
>> +
>> +	while (pos<  len) {
>> +		ret = vfs_write(sctx->send_filp, (char *)buf + pos, len - pos,
>> +				&sctx->send_off);
>> +		/* TODO handle that correctly */
>> +		/*if (ret == -ERESTARTSYS) {
>> +			continue;
>> +		}*/
>
> I prefer #if 0 over comments to disable code, but I don't know what the
> styleguide has to say to that.
Oh oh...now I remember a TODO I had in my head: Handly ERESTARTSYS 
correctly when calling vfs_write. This is a big issue at the moment, as 
in case this happens we currently bail out from the whole send 
code...the ioctl call then gets restarted and we completely start from 
the beginning which is really bad. I probably need help here. Is there a 
way to disable ERESTARTSYS temporary? Any other possible solutions?
>
>> +		if (ret<  0) {
>> +			printk("%d\n", ret);
>
> This is not the most verbose error message of all :)
>
Whoops...removed.
>> +			goto out;
>> +		}
>> +		if (ret == 0) {
>> +			ret = -EIO;
>> +			goto out;
>> +		}
>> +		pos += ret;
>> +	}
>> +
>> +	ret = 0;
>> +
>> +out:
>> +	set_fs(old_fs);
>> +	return ret;
>> +}
>> +
>> +static int tlv_put(struct send_ctx *sctx, u16 attr, const void *data, int len)
>> +{
>> +	struct btrfs_tlv_header *hdr;
>> +	int total_len = sizeof(*hdr) + len;
>> +	int left = sctx->send_max_size - sctx->send_size;
>> +
>> +	if (unlikely(left<  total_len))
>> +		return -EOVERFLOW;
>> +
>> +	hdr = (struct btrfs_tlv_header *) (sctx->send_buf + sctx->send_size);
>> +	hdr->tlv_type = cpu_to_le16(attr);
>> +	hdr->tlv_len = cpu_to_le16(len);
>
> you might want to check for len overflow here
>
The if(unlikely... is already handling this.
>> +	memcpy(hdr + 1, data, len);
>> +	sctx->send_size += total_len;
>> +
>> +	return 0;
>> +}
>> +
>> +#if 0
>> +static int tlv_put_u8(struct send_ctx *sctx, u16 attr, u8 value)
>> +{
>> +	return tlv_put(sctx, attr,&value, sizeof(value));
>> +}
>> +
>> +static int tlv_put_u16(struct send_ctx *sctx, u16 attr, u16 v)
>> +{
>> +	__le16 tmp = cpu_to_le16(value);
>
> s/value/v
>
I've done the opposite now.
>> +	return tlv_put(sctx, attr,&tmp, sizeof(tmp));
>> +}
>> +
>> +static int tlv_put_u32(struct send_ctx *sctx, u16 attr, u32 value)
>> +{
>> +	__le32 tmp = cpu_to_le32(value);
>> +	return tlv_put(sctx, attr,&tmp, sizeof(tmp));
>> +}
>> +#endif
>> +
>> +static int tlv_put_u64(struct send_ctx *sctx, u16 attr, u64 value)
>> +{
>> +	__le64 tmp = cpu_to_le64(value);
>> +	return tlv_put(sctx, attr,&tmp, sizeof(tmp));
>> +}
>> +
>> +static int tlv_put_string(struct send_ctx *sctx, u16 attr,
>> +			  const char *str, int len)
>> +{
>> +	if (len == -1)
>> +		len = strlen(str);
>> +	return tlv_put(sctx, attr, str, len);
>> +}
>> +
>> +static int tlv_put_uuid(struct send_ctx *sctx, u16 attr,
>> +			const u8 *uuid)
>> +{
>> +	return tlv_put(sctx, attr, uuid, BTRFS_UUID_SIZE);
>> +}
>> +
>> +#if 0
>> +static int tlv_put_timespec(struct send_ctx *sctx, u16 attr,
>> +			    struct timespec *ts)
>> +{
>> +	struct btrfs_timespec bts;
>> +	bts.sec = cpu_to_le64(ts->tv_sec);
>> +	bts.nsec = cpu_to_le32(ts->tv_nsec);
>> +	return tlv_put(sctx, attr,&bts, sizeof(bts));
>> +}
>> +#endif
>> +
>> +static int tlv_put_btrfs_timespec(struct send_ctx *sctx, u16 attr,
>> +				  struct extent_buffer *eb,
>> +				  struct btrfs_timespec *ts)
>> +{
>> +	struct btrfs_timespec bts;
>> +	read_extent_buffer(eb,&bts, (unsigned long)ts, sizeof(bts));
>> +	return tlv_put(sctx, attr,&bts, sizeof(bts));
>> +}
>> +
>> +
>> +#define TLV_PUT(sctx, attrtype, attrlen, data) \
>> +	do { \
>> +		ret = tlv_put(sctx, attrtype, attrlen, data); \
>> +		if (ret<  0) \
>> +			goto tlv_put_failure; \
>> +	} while (0)
>> +
>> +#define TLV_PUT_INT(sctx, attrtype, bits, value) \
>> +	do { \
>> +		ret = tlv_put_u##bits(sctx, attrtype, value); \
>> +		if (ret<  0) \
>> +			goto tlv_put_failure; \
>> +	} while (0)
>> +
>> +#define TLV_PUT_U8(sctx, attrtype, data) TLV_PUT_INT(sctx, attrtype, 8, data)
>> +#define TLV_PUT_U16(sctx, attrtype, data) TLV_PUT_INT(sctx, attrtype, 16, data)
>> +#define TLV_PUT_U32(sctx, attrtype, data) TLV_PUT_INT(sctx, attrtype, 32, data)
>> +#define TLV_PUT_U64(sctx, attrtype, data) TLV_PUT_INT(sctx, attrtype, 64, data)
>> +#define TLV_PUT_STRING(sctx, attrtype, str, len) \
>> +	do { \
>> +		ret = tlv_put_string(sctx, attrtype, str, len); \
>> +		if (ret<  0) \
>> +			goto tlv_put_failure; \
>> +	} while (0)
>> +#define TLV_PUT_PATH(sctx, attrtype, p) \
>> +	do { \
>> +		ret = tlv_put_string(sctx, attrtype, p->start, \
>> +			p->end - p->start); \
>> +		if (ret<  0) \
>> +			goto tlv_put_failure; \
>> +	} while(0)
>> +#define TLV_PUT_UUID(sctx, attrtype, uuid) \
>> +	do { \
>> +		ret = tlv_put_uuid(sctx, attrtype, uuid); \
>> +		if (ret<  0) \
>> +			goto tlv_put_failure; \
>> +	} while (0)
>> +#define TLV_PUT_TIMESPEC(sctx, attrtype, ts) \
>> +	do { \
>> +		ret = tlv_put_timespec(sctx, attrtype, ts); \
>> +		if (ret<  0) \
>> +			goto tlv_put_failure; \
>> +	} while (0)
>> +#define TLV_PUT_BTRFS_TIMESPEC(sctx, attrtype, eb, ts) \
>> +	do { \
>> +		ret = tlv_put_btrfs_timespec(sctx, attrtype, eb, ts); \
>> +		if (ret<  0) \
>> +			goto tlv_put_failure; \
>> +	} while (0)
>> +
>> +static int send_header(struct send_ctx *sctx)
>> +{
>> +	int ret;
>> +	struct btrfs_stream_header hdr;
>> +
>> +	strcpy(hdr.magic, BTRFS_SEND_STREAM_MAGIC);
>> +	hdr.version = cpu_to_le32(BTRFS_SEND_STREAM_VERSION);
>> +
>> +	ret = write_buf(sctx,&hdr, sizeof(hdr));
>> +
>> +	return ret;
>
> (just return write_buf)
>
Done.
>> +}
>> +
>> +/*
>> + * For each command/item we want to send to userspace, we call this function.
>> + */
>> +static int begin_cmd(struct send_ctx *sctx, int cmd)
>> +{
>> +	int ret = 0;
>> +	struct btrfs_cmd_header *hdr;
>> +
>> +	if (!sctx->send_buf) {
>> +		WARN_ON(1);
>> +		return -EINVAL;
>> +	}
>> +
>> +	BUG_ON(!sctx->send_buf);
>
> that's kind of redundant.
>
Removed.
>> +	BUG_ON(sctx->send_size);
>> +
>> +	sctx->send_size += sizeof(*hdr);
>> +	hdr = (struct btrfs_cmd_header *)sctx->send_buf;
>> +	hdr->cmd = cpu_to_le16(cmd);
>> +
>> +	return ret;
>
> ret is untouched here
>
Removed ret.
>> +}
>> +
>> +static int send_cmd(struct send_ctx *sctx)
>> +{
>> +	int ret;
>> +	struct btrfs_cmd_header *hdr;
>> +	u32 crc;
>> +
>> +	hdr = (struct btrfs_cmd_header *)sctx->send_buf;
>> +	hdr->len = cpu_to_le32(sctx->send_size - sizeof(*hdr));
>> +	hdr->crc = 0;
>> +
>> +	crc = crc32c(0, (unsigned char *)sctx->send_buf, sctx->send_size);
>> +	hdr->crc = cpu_to_le32(crc);
>> +
>> +	ret = write_buf(sctx, sctx->send_buf, sctx->send_size);
>> +
>> +	sctx->total_send_size += sctx->send_size;
>> +	sctx->cmd_send_size[le16_to_cpu(hdr->cmd)] += sctx->send_size;
>> +	sctx->send_size = 0;
>> +
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Sends a move instruction to user space
>> + */
>> +static int send_rename(struct send_ctx *sctx,
>> +		     struct fs_path *from, struct fs_path *to)
>> +{
>> +	int ret;
>> +
>> +verbose_printk("btrfs: send_rename %s ->  %s\n", from->start, to->start);
>> +
>> +	ret = begin_cmd(sctx, BTRFS_SEND_C_RENAME);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, from);
>> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH_TO, to);
>> +
>> +	ret = send_cmd(sctx);
>> +
>> +tlv_put_failure:
>> +out:
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Sends a link instruction to user space
>> + */
>> +static int send_link(struct send_ctx *sctx,
>> +		     struct fs_path *path, struct fs_path *lnk)
>> +{
>> +	int ret;
>> +
>> +verbose_printk("btrfs: send_link %s ->  %s\n", path->start, lnk->start);
>> +
>> +	ret = begin_cmd(sctx, BTRFS_SEND_C_LINK);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path);
>> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH_LINK, lnk);
>> +
>> +	ret = send_cmd(sctx);
>> +
>> +tlv_put_failure:
>> +out:
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Sends an unlink instruction to user space
>> + */
>> +static int send_unlink(struct send_ctx *sctx, struct fs_path *path)
>> +{
>> +	int ret;
>> +
>> +verbose_printk("btrfs: send_unlink %s\n", path->start);
>> +
>> +	ret = begin_cmd(sctx, BTRFS_SEND_C_UNLINK);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path);
>> +
>> +	ret = send_cmd(sctx);
>> +
>> +tlv_put_failure:
>> +out:
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Sends a rmdir instruction to user space
>> + */
>> +static int send_rmdir(struct send_ctx *sctx, struct fs_path *path)
>> +{
>> +	int ret;
>> +
>> +verbose_printk("btrfs: send_rmdir %s\n", path->start);
>> +
>> +	ret = begin_cmd(sctx, BTRFS_SEND_C_RMDIR);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path);
>> +
>> +	ret = send_cmd(sctx);
>> +
>> +tlv_put_failure:
>> +out:
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Helper function to retrieve some fields from an inode item.
>> + */
>> +static int get_inode_info(struct btrfs_root *root,
>> +			  u64 ino, u64 *size, u64 *gen,
>> +			  u64 *mode, u64 *uid, u64 *gid)
>> +{
>> +	int ret;
>> +	struct btrfs_inode_item *ii;
>> +	struct btrfs_key key;
>> +	struct btrfs_path *path;
>> +
>> +	path = alloc_path_for_send();
>> +	if (!path)
>> +		return -ENOMEM;
>> +
>> +	key.objectid = ino;
>> +	key.type = BTRFS_INODE_ITEM_KEY;
>> +	key.offset = 0;
>> +	ret = btrfs_search_slot(NULL, root,&key, path, 0, 0);
>> +	if (ret<  0)
>> +		goto out;
>> +	if (ret) {
>> +		ret = -ENOENT;
>> +		goto out;
>> +	}
>> +
>> +	ii = btrfs_item_ptr(path->nodes[0], path->slots[0],
>> +			struct btrfs_inode_item);
>> +	if (size)
>> +		*size = btrfs_inode_size(path->nodes[0], ii);
>> +	if (gen)
>> +		*gen = btrfs_inode_generation(path->nodes[0], ii);
>> +	if (mode)
>> +		*mode = btrfs_inode_mode(path->nodes[0], ii);
>> +	if (uid)
>> +		*uid = btrfs_inode_uid(path->nodes[0], ii);
>> +	if (gid)
>> +		*gid = btrfs_inode_gid(path->nodes[0], ii);
>> +
>> +out:
>> +	btrfs_free_path(path);
>> +	return ret;
>> +}
>> +
>> +typedef int (*iterate_inode_ref_t)(int num, u64 dir, int index,
>> +				   struct fs_path *p,
>> +				   void *ctx);
>> +
>> +/*
>> + * Helper function to iterate the entries in ONE btrfs_inode_ref.
>> + * The iterate callback may return a non zero value to stop iteration. This can
>> + * be a negative value for error codes or 1 to simply stop it.
>> + *
>> + * path must point to the INODE_REF when called.
>> + */
>> +static int iterate_inode_ref(struct send_ctx *sctx,
>> +			     struct btrfs_root *root, struct btrfs_path *path,
>> +			     struct btrfs_key *found_key, int resolve,
>> +			     iterate_inode_ref_t iterate, void *ctx)
>> +{
>> +	struct extent_buffer *eb;
>> +	struct btrfs_item *item;
>> +	struct btrfs_inode_ref *iref;
>> +	struct btrfs_path *tmp_path;
>> +	struct fs_path *p;
>> +	u32 cur;
>> +	u32 len;
>> +	u32 total;
>> +	int slot;
>> +	u32 name_len;
>> +	char *start;
>> +	int ret = 0;
>> +	int num;
>> +	int index;
>> +
>> +	p = fs_path_alloc_reversed(sctx);
>> +	if (!p)
>> +		return -ENOMEM;
>> +
>> +	tmp_path = alloc_path_for_send();
>> +	if (!tmp_path) {
>> +		fs_path_free(sctx, p);
>> +		return -ENOMEM;
>> +	}
>> +
>> +	eb = path->nodes[0];
>> +	slot = path->slots[0];
>> +	item = btrfs_item_nr(eb, slot);
>> +	iref = btrfs_item_ptr(eb, slot, struct btrfs_inode_ref);
>> +	cur = 0;
>> +	len = 0;
>> +	total = btrfs_item_size(eb, item);
>> +
>> +	num = 0;
>> +	while (cur<  total) {
>> +		fs_path_reset(p);
>> +
>> +		name_len = btrfs_inode_ref_name_len(eb, iref);
>> +		index = btrfs_inode_ref_index(eb, iref);
>> +		if (resolve) {
>> +			start = btrfs_iref_to_path(root, tmp_path, iref, eb,
>> +						found_key->offset, p->buf,
>> +						p->buf_len);
>
> it might be worth it to build a better integration between
> iref_to_path and your fs_path data structure. Maybe iref_to_path
> can make direct use of fs_path.
>
That's something I thought about in the past but dropped that because I 
did not want to touch too much other parts of btrfs with my patches. I 
think this is something that should be done when the patches are in 
upstream.
>> +			if (IS_ERR(start)) {
>> +				ret = PTR_ERR(start);
>> +				goto out;
>> +			}
>> +			if (start<  p->buf) {
>> +				/* overflow , try again with larger buffer */
>> +				ret = fs_path_ensure_buf(p,
>> +						p->buf_len + p->buf - start);
>> +				if (ret<  0)
>> +					goto out;
>> +				start = btrfs_iref_to_path(root, tmp_path, iref,
>> +						eb, found_key->offset, p->buf,
>> +						p->buf_len);
>> +				if (IS_ERR(start)) {
>> +					ret = PTR_ERR(start);
>> +					goto out;
>> +				}
>> +				BUG_ON(start<  p->buf);
>> +			}
>> +			p->start = start;
>> +		} else {
>> +			ret = fs_path_add_from_extent_buffer(p, eb,
>> +					(unsigned long)(iref + 1), name_len);
>> +			if (ret<  0)
>> +				goto out;
>> +		}
>> +
>> +
>> +		len = sizeof(*iref) + name_len;
>> +		iref = (struct btrfs_inode_ref *)((char *)iref + len);
>> +		cur += len;
>> +
>> +		ret = iterate(num, found_key->offset, index, p, ctx);
>> +		if (ret<  0)
>> +			goto out;
>> +		if (ret) {
>> +			ret = 0;
>
> wouldn't it make sense to pass this information on to the caller?
>
Yepp, makes sense. Changed it to return the info. Also updated all 
callers to watch out for ret > 0.
>> +			goto out;
>> +		}
>> +
>> +		num++;
>> +	}
>> +
>> +out:
>> +	btrfs_free_path(tmp_path);
>> +	fs_path_free(sctx, p);
>> +	return ret;
>> +}
>> +
>> +typedef int (*iterate_dir_item_t)(int num, const char *name, int name_len,
>> +				  const char *data, int data_len,
>> +				  u8 type, void *ctx);
>> +
>> +/*
>> + * Helper function to iterate the entries in ONE btrfs_dir_item.
>> + * The iterate callback may return a non zero value to stop iteration. This can
>> + * be a negative value for error codes or 1 to simply stop it.
>> + *
>> + * path must point to the dir item when called.
>> + */
>> +static int iterate_dir_item(struct send_ctx *sctx,
>> +			    struct btrfs_root *root, struct btrfs_path *path,
>> +			    struct btrfs_key *found_key,
>> +			    iterate_dir_item_t iterate, void *ctx)
>> +{
>> +	int ret = 0;
>> +	struct extent_buffer *eb;
>> +	struct btrfs_item *item;
>> +	struct btrfs_dir_item *di;
>> +	struct btrfs_path *tmp_path = NULL;
>> +	char *buf = NULL;
>> +	char *buf2 = NULL;
>> +	int buf_len;
>> +	int buf_virtual = 0;
>> +	u32 name_len;
>> +	u32 data_len;
>> +	u32 cur;
>> +	u32 len;
>> +	u32 total;
>> +	int slot;
>> +	int num;
>> +	u8 type;
>> +
>> +	buf_len = PAGE_SIZE;
>> +	buf = kmalloc(buf_len, GFP_NOFS);
>> +	if (!buf) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	tmp_path = alloc_path_for_send();
>> +	if (!tmp_path) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	eb = path->nodes[0];
>> +	slot = path->slots[0];
>> +	item = btrfs_item_nr(eb, slot);
>> +	di = btrfs_item_ptr(eb, slot, struct btrfs_dir_item);
>> +	cur = 0;
>> +	len = 0;
>> +	total = btrfs_item_size(eb, item);
>> +
>> +	num = 0;
>> +	while (cur<  total) {
>> +		name_len = btrfs_dir_name_len(eb, di);
>> +		data_len = btrfs_dir_data_len(eb, di);
>> +		type = btrfs_dir_type(eb, di);
>> +
>> +		if (name_len + data_len>  buf_len) {
>> +			buf_len = PAGE_ALIGN(name_len + data_len);
>> +			if (buf_virtual) {
>> +				buf2 = vmalloc(buf_len);
>> +				if (!buf2) {
>> +					ret = -ENOMEM;
>> +					goto out;
>> +				}
>> +				vfree(buf);
>> +			} else {
>> +				buf2 = krealloc(buf, buf_len, GFP_NOFS);
>> +				if (!buf2) {
>> +					buf2 = vmalloc(buf_len);
>> +					if (!buf) {
>
> !buf2
>
Fixed.
>> +						ret = -ENOMEM;
>> +						goto out;
>> +					}
>> +					kfree(buf);
>> +					buf_virtual = 1;
>> +				}
>> +			}
>> +
>> +			buf = buf2;
>> +			buf2 = NULL;
>> +		}
>> +
>> +		read_extent_buffer(eb, buf, (unsigned long)(di + 1),
>> +				name_len + data_len);
>> +
>> +		len = sizeof(*di) + name_len + data_len;
>> +		di = (struct btrfs_dir_item *)((char *)di + len);
>> +		cur += len;
>> +
>> +		ret = iterate(num, buf, name_len, buf + name_len, data_len,
>> +				type, ctx);
>> +		if (ret<  0)
>> +			goto out;
>> +		if (ret) {
>> +			ret = 0;
>> +			goto out;
>> +		}
>> +
>> +		num++;
>> +	}
>> +
>> +out:
>> +	btrfs_free_path(tmp_path);
>> +	if (buf_virtual)
>> +		vfree(buf);
>> +	else
>> +		kfree(buf);
>> +	return ret;
>> +}
>> +
>> +static int __copy_first_ref(int num, u64 dir, int index,
>> +			    struct fs_path *p, void *ctx)
>> +{
>> +	int ret;
>> +	struct fs_path *pt = ctx;
>> +
>> +	ret = fs_path_copy(pt, p);
>> +	if (ret<  0)
>> +		return ret;
>> +
>> +	/* we want the first only */
>> +	return 1;
>> +}
>> +
>> +/*
>> + * Retrieve the first path of an inode. If an inode has more then one
>> + * ref/hardlink, this is ignored.
>> + */
>> +static int get_inode_path(struct send_ctx *sctx, struct btrfs_root *root,
>> +			  u64 ino, struct fs_path *path)
>> +{
>> +	int ret;
>> +	struct btrfs_key key, found_key;
>> +	struct btrfs_path *p;
>> +
>> +	p = alloc_path_for_send();
>> +	if (!p)
>> +		return -ENOMEM;
>> +
>> +	fs_path_reset(path);
>> +
>> +	key.objectid = ino;
>> +	key.type = BTRFS_INODE_REF_KEY;
>> +	key.offset = 0;
>> +
>> +	ret = btrfs_search_slot_for_read(root,&key, p, 1, 0);
>> +	if (ret<  0)
>> +		goto out;
>> +	if (ret) {
>> +		ret = 1;
>> +		goto out;
>> +	}
>> +	btrfs_item_key_to_cpu(p->nodes[0],&found_key, p->slots[0]);
>> +	if (found_key.objectid != ino ||
>> +		found_key.type != BTRFS_INODE_REF_KEY) {
>> +		ret = -ENOENT;
>> +		goto out;
>> +	}
>> +
>> +	ret = iterate_inode_ref(sctx, root, p,&found_key, 1,
>> +			__copy_first_ref, path);
>> +	if (ret<  0)
>> +		goto out;
>> +	ret = 0;
>> +
>> +out:
>> +	btrfs_free_path(p);
>> +	return ret;
>> +}
>> +
>> diff --git a/fs/btrfs/send.h b/fs/btrfs/send.h
>> new file mode 100644
>> index 0000000..a4c23ee
>> --- /dev/null
>> +++ b/fs/btrfs/send.h
>> @@ -0,0 +1,126 @@
>> +/*
>> + * Copyright (C) 2012 Alexander Block.  All rights reserved.
>> + * Copyright (C) 2012 STRATO.  All rights reserved.
>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU General Public
>> + * License v2 as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> + * General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public
>> + * License along with this program; if not, write to the
>> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
>> + * Boston, MA 021110-1307, USA.
>> + */
>> +
>> +#include "ctree.h"
>> +
>> +#define BTRFS_SEND_STREAM_MAGIC "btrfs-stream"
>> +#define BTRFS_SEND_STREAM_VERSION 1
>> +
>> +#define BTRFS_SEND_BUF_SIZE (1024 * 64)
>> +#define BTRFS_SEND_READ_SIZE (1024 * 48)
>> +
>> +enum btrfs_tlv_type {
>> +	BTRFS_TLV_U8,
>> +	BTRFS_TLV_U16,
>> +	BTRFS_TLV_U32,
>> +	BTRFS_TLV_U64,
>> +	BTRFS_TLV_BINARY,
>> +	BTRFS_TLV_STRING,
>> +	BTRFS_TLV_UUID,
>> +	BTRFS_TLV_TIMESPEC,
>> +};
>> +
>> +struct btrfs_stream_header {
>> +	char magic[sizeof(BTRFS_SEND_STREAM_MAGIC)];
>> +	__le32 version;
>> +} __attribute__ ((__packed__));
>> +
>> +struct btrfs_cmd_header {
>> +	__le32 len;
>> +	__le16 cmd;
>> +	__le32 crc;
>> +} __attribute__ ((__packed__));
>
> Please add some comments for this struct, e.g. that the len
> is the len excluding the header, and the crc is calculated
> over the full cmd including header with crc set to 0.
>
Added comments.
>> +
>> +struct btrfs_tlv_header {
>> +	__le16 tlv_type;
>> +	__le16 tlv_len;
>> +} __attribute__ ((__packed__));
>> +
>> +/* commands */
>> +enum btrfs_send_cmd {
>> +	BTRFS_SEND_C_UNSPEC,
>> +
>> +	BTRFS_SEND_C_SUBVOL,
>> +	BTRFS_SEND_C_SNAPSHOT,
>> +
>> +	BTRFS_SEND_C_MKFILE,
>> +	BTRFS_SEND_C_MKDIR,
>> +	BTRFS_SEND_C_MKNOD,
>> +	BTRFS_SEND_C_MKFIFO,
>> +	BTRFS_SEND_C_MKSOCK,
>> +	BTRFS_SEND_C_SYMLINK,
>> +
>> +	BTRFS_SEND_C_RENAME,
>> +	BTRFS_SEND_C_LINK,
>> +	BTRFS_SEND_C_UNLINK,
>> +	BTRFS_SEND_C_RMDIR,
>> +
>> +	BTRFS_SEND_C_SET_XATTR,
>> +	BTRFS_SEND_C_REMOVE_XATTR,
>> +
>> +	BTRFS_SEND_C_WRITE,
>> +	BTRFS_SEND_C_CLONE,
>> +
>> +	BTRFS_SEND_C_TRUNCATE,
>> +	BTRFS_SEND_C_CHMOD,
>> +	BTRFS_SEND_C_CHOWN,
>> +	BTRFS_SEND_C_UTIMES,
>> +
>> +	BTRFS_SEND_C_END,
>> +	__BTRFS_SEND_C_MAX,
>> +};
>> +#define BTRFS_SEND_C_MAX (__BTRFS_SEND_C_MAX - 1)
>> +
>> +/* attributes in send stream */
>> +enum {
>> +	BTRFS_SEND_A_UNSPEC,
>> +
>> +	BTRFS_SEND_A_UUID,
>> +	BTRFS_SEND_A_CTRANSID,
>> +
>> +	BTRFS_SEND_A_INO,
>> +	BTRFS_SEND_A_SIZE,
>> +	BTRFS_SEND_A_MODE,
>> +	BTRFS_SEND_A_UID,
>> +	BTRFS_SEND_A_GID,
>> +	BTRFS_SEND_A_RDEV,
>> +	BTRFS_SEND_A_CTIME,
>> +	BTRFS_SEND_A_MTIME,
>> +	BTRFS_SEND_A_ATIME,
>> +	BTRFS_SEND_A_OTIME,
>> +
>> +	BTRFS_SEND_A_XATTR_NAME,
>> +	BTRFS_SEND_A_XATTR_DATA,
>> +
>> +	BTRFS_SEND_A_PATH,
>> +	BTRFS_SEND_A_PATH_TO,
>> +	BTRFS_SEND_A_PATH_LINK,
>> +
>> +	BTRFS_SEND_A_FILE_OFFSET,
>> +	BTRFS_SEND_A_DATA,
>> +
>> +	BTRFS_SEND_A_CLONE_UUID,
>> +	BTRFS_SEND_A_CLONE_CTRANSID,
>> +	BTRFS_SEND_A_CLONE_PATH,
>> +	BTRFS_SEND_A_CLONE_OFFSET,
>> +	BTRFS_SEND_A_CLONE_LEN,
>> +
>> +	__BTRFS_SEND_A_MAX,
>> +};
>> +#define BTRFS_SEND_A_MAX (__BTRFS_SEND_A_MAX - 1)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 7/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 2)
  2012-07-25 17:20       ` Alex Lyakas
@ 2012-07-25 17:41         ` Alexander Block
  0 siblings, 0 replies; 43+ messages in thread
From: Alexander Block @ 2012-07-25 17:41 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs

On Wed, Jul 25, 2012 at 7:20 PM, Alex Lyakas
<alex.bolshoy.btrfs@gmail.com> wrote:
> Alexander,
>
>>> Same is true for BTRFS_FILE_EXTENT_PREALLOC extents, I think. Those
>>> also don't contain real data.
>>> So something like:
>>> if (left_disknr == 0 || left_type == BTRFS_FILE_EXTENT_REG) {
>>>         ret = 1;
>>>         goto out;
>>> }
>> Do you mean "|| left_type == BTRFS_FILE_EXTENT_PREALLOC"?
>
> I see your point about bytenr==0, I missed that on the parent tree it
> can be something else.
>
> As for PREALLOC: can it happen that on differential send we see extent
> of type BTRFS_FILE_EXTENT_PREALLOC? And can it happen that parent had
> some real data extent in that place? I don't know the answer, but if
> yes, then we must treat PREALLOC as normal extent. So this case is
> similar to bytenr==0.
>
I also don't know if that may happen. Currently, only REG extents are
checked by is_extent_unchanged. All other types are regarded as
changed and will be sent. So in the worst case the stream gets larget
then it should be, but we won't loose data. I need to leave in a few
minutes and will continue working on btrfs send/receive v2 later
today. We should probably postpone "optimizations" (actually bug
fixing) here for later...don't know if I find enough time to
investigate more.

> Thanks,
> Alex.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 7/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 2)
  2012-07-23 11:16   ` Arne Jansen
  2012-07-23 15:28     ` Alex Lyakas
@ 2012-07-28 13:49     ` Alexander Block
  1 sibling, 0 replies; 43+ messages in thread
From: Alexander Block @ 2012-07-28 13:49 UTC (permalink / raw)
  To: Arne Jansen; +Cc: linux-btrfs

On 07/23/2012 01:16 PM, Arne Jansen wrote:
> This is a first review run. I ask for more comments in several places.
> Maybe these comments can help to dive deeper into a functional review
> in a second run.
> I'd really appreciate it if you could write a few pages about the
> concepts how you decide what to send and when.
> It seems there's still a lot of headroom for performance optimizations
> cpu/seek-wise.
I started do document stuff in
http://btrfs.wiki.kernel.org/index.php/Btrfs_Send/Receive
There is also a collection for optimizations that I have in mind for later.
> All in all I really like this work.
>
> On 04.07.2012 15:38, Alexander Block wrote:
>> This is the second part of the splitted BTRFS_IOC_SEND patch which
>> contains the actual send logic.
>>
>> Signed-off-by: Alexander Block<ablock84@googlemail.com>
>> ---
>>   fs/btrfs/ioctl.c |    3 +
>>   fs/btrfs/send.c  | 3246 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/send.h  |    4 +
>>   3 files changed, 3253 insertions(+)
>>
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index 8d258cb..9173867 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -54,6 +54,7 @@
>>   #include "inode-map.h"
>>   #include "backref.h"
>>   #include "rcu-string.h"
>> +#include "send.h"
>>
>>   /* Mask out flags that are inappropriate for the given type of inode. */
>>   static inline __u32 btrfs_mask_flags(umode_t mode, __u32 flags)
>> @@ -3567,6 +3568,8 @@ long btrfs_ioctl(struct file *file, unsigned int
>>   		return btrfs_ioctl_balance_progress(root, argp);
>>   	case BTRFS_IOC_SET_RECEIVED_SUBVOL:
>>   		return btrfs_ioctl_set_received_subvol(file, argp);
>> +	case BTRFS_IOC_SEND:
>> +		return btrfs_ioctl_send(file, argp);
>>   	case BTRFS_IOC_GET_DEV_STATS:
>>   		return btrfs_ioctl_get_dev_stats(root, argp, 0);
>>   	case BTRFS_IOC_GET_AND_RESET_DEV_STATS:
>> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
>> index 47a2557..4d3fcfc 100644
>> --- a/fs/btrfs/send.c
>> +++ b/fs/btrfs/send.c
>> @@ -1007,3 +1007,3249 @@ out:
>>   	return ret;
>>   }
>>
>> +struct backref_ctx {
>> +	struct send_ctx *sctx;
>> +
>> +	/* number of total found references */
>> +	u64 found;
>> +
>> +	/*
>> +	 * used for clones found in send_root. clones found behind cur_objectid
>> +	 * and cur_offset are not considered as allowed clones.
>> +	 */
>> +	u64 cur_objectid;
>> +	u64 cur_offset;
>> +
>> +	/* may be truncated in case it's the last extent in a file */
>> +	u64 extent_len;
>> +
>> +	/* Just to check for bugs in backref resolving */
>> +	int found_in_send_root;
>> +};
>> +
>> +static int __clone_root_cmp_bsearch(const void *key, const void *elt)
>> +{
>> +	u64 root = (u64)key;
>> +	struct clone_root *cr = (struct clone_root *)elt;
>> +
>> +	if (root<  cr->root->objectid)
>> +		return -1;
>> +	if (root>  cr->root->objectid)
>> +		return 1;
>> +	return 0;
>> +}
>> +
>> +static int __clone_root_cmp_sort(const void *e1, const void *e2)
>> +{
>> +	struct clone_root *cr1 = (struct clone_root *)e1;
>> +	struct clone_root *cr2 = (struct clone_root *)e2;
>> +
>> +	if (cr1->root->objectid<  cr2->root->objectid)
>> +		return -1;
>> +	if (cr1->root->objectid>  cr2->root->objectid)
>> +		return 1;
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Called for every backref that is found for the current extent.
>
> Comment: results are collected in sctx->clone_roots->ino/offset/found_refs
>
>> + */
>> +static int __iterate_backrefs(u64 ino, u64 offset, u64 root, void *ctx_)
>> +{
>> +	struct backref_ctx *bctx = ctx_;
>> +	struct clone_root *found;
>> +	int ret;
>> +	u64 i_size;
>> +
>> +	/* First check if the root is in the list of accepted clone sources */
>> +	found = bsearch((void *)root, bctx->sctx->clone_roots,
>> +			bctx->sctx->clone_roots_cnt,
>> +			sizeof(struct clone_root),
>> +			__clone_root_cmp_bsearch);
>> +	if (!found)
>> +		return 0;
>> +
>> +	if (found->root == bctx->sctx->send_root&&
>> +	    ino == bctx->cur_objectid&&
>> +	    offset == bctx->cur_offset) {
>> +		bctx->found_in_send_root = 1;
>
> found_in_send_root_and_cur_ino_offset?
I renamed it to found_itself. Hope that's more clear.
>
>> +	}
>> +
>> +	/*
>> +	 * There are inodes that have extents that lie behind it's i_size. Don't
>                                                                its
Fixed.
>> +	 * accept clones from these extents.
>> +	 */
>> +	ret = get_inode_info(found->root, ino,&i_size, NULL, NULL, NULL, NULL);
>> +	if (ret<  0)
>> +		return ret;
>> +
>> +	if (offset + bctx->extent_len>  i_size)
>> +		return 0;
>> +
>> +	/*
>> +	 * Make sure we don't consider clones from send_root that are
>> +	 * behind the current inode/offset.
>> +	 */
>> +	if (found->root == bctx->sctx->send_root) {
>> +		/*
>> +		 * TODO for the moment we don't accept clones from the inode
>> +		 * that is currently send. We may change this when
>> +		 * BTRFS_IOC_CLONE_RANGE supports cloning from and to the same
>> +		 * file.
>> +		 */
>> +		if (ino>= bctx->cur_objectid)
>> +			return 0;
>> +		/*if (ino>  ctx->cur_objectid)
>> +			return 0;
>> +		if (offset + ctx->extent_len>  ctx->cur_offset)
>> +			return 0;*/
>
> #if 0 ... #else ... #endif
Fixed.
>
>> +
>> +		bctx->found++;
>> +		found->found_refs++;
>> +		found->ino = ino;
>> +		found->offset = offset;
>
> only the last ino is kept?
>
I removed that return path. Now the code below the if handles that too.
>> +		return 0;
>> +	}
>> +
>> +	bctx->found++;
>> +	found->found_refs++;
>> +	if (ino<  found->ino) {
>> +		found->ino = ino;
>> +		found->offset = offset;
>
> whereas here only the lowest ino is kept. Why?
>
I take the lowest because then we have the same file as clone source 
every time when extents got cloned multiple times.
>> +	} else if (found->ino == ino) {
>> +		/*
>> +		 * same extent found more then once in the same file.
>> +		 */
>> +		if (found->offset>  offset + bctx->extent_len)
>> +			found->offset = offset;
>
> This is unclear to me. Seems to mean something like
> 'find the lowest offset', but not exactly. Some explaination
> would be good.
Hmm...I don't remember why it was needed. I remember that I added it 
later for some reason but can't see now why.
>
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +/*
>> + * path must point to the extent item when called.
>> + */
>
> What is the purpose of this function? I probably will figure it out
> when reading on, but a comment would be nice here.
>
Added a description of the function.
>> +static int find_extent_clone(struct send_ctx *sctx,
>> +			     struct btrfs_path *path,
>> +			     u64 ino, u64 data_offset,
>> +			     u64 ino_size,
>> +			     struct clone_root **found)
>> +{
>> +	int ret;
>> +	int extent_type;
>> +	u64 logical;
>> +	u64 num_bytes;
>> +	u64 extent_item_pos;
>> +	struct btrfs_file_extent_item *fi;
>> +	struct extent_buffer *eb = path->nodes[0];
>> +	struct backref_ctx backref_ctx;
>
> currently it's still small enough to keep in on stack, maybe a
> comment in struct backref_ctx that it is kept on stack would be
> nice.
>
To make sure, I've removed it from the stack and use kmalloc now.
>> +	struct clone_root *cur_clone_root;
>> +	struct btrfs_key found_key;
>> +	struct btrfs_path *tmp_path;
>> +	u32 i;
>> +
>> +	tmp_path = alloc_path_for_send();
>> +	if (!tmp_path)
>> +		return -ENOMEM;
>> +
>> +	if (data_offset>= ino_size) {
>> +		/*
>> +		 * There may be extents that lie behind the file's size.
>> +		 * I at least had this in combination with snapshotting while
>> +		 * writing large files.
>> +		 */
>> +		ret = 0;
>> +		goto out;
>> +	}
>> +
>> +	fi = btrfs_item_ptr(eb, path->slots[0],
>> +			struct btrfs_file_extent_item);
>> +	extent_type = btrfs_file_extent_type(eb, fi);
>> +	if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
>> +		ret = -ENOENT;
>> +		goto out;
>> +	}
>> +
>> +	num_bytes = btrfs_file_extent_num_bytes(eb, fi);
>> +	logical = btrfs_file_extent_disk_bytenr(eb, fi);
>> +	if (logical == 0) {
>> +		ret = -ENOENT;
>> +		goto out;
>> +	}
>> +	logical += btrfs_file_extent_offset(eb, fi);
>> +
>> +	ret = extent_from_logical(sctx->send_root->fs_info,
>> +			logical, tmp_path,&found_key);
>> +	btrfs_release_path(tmp_path);
>> +
>> +	if (ret<  0)
>> +		goto out;
>> +	if (ret&  BTRFS_EXTENT_FLAG_TREE_BLOCK) {
>> +		ret = -EIO;
>> +		goto out;
>> +	}
>> +
>> +	/*
>> +	 * Setup the clone roots.
>> +	 */
>> +	for (i = 0; i<  sctx->clone_roots_cnt; i++) {
>> +		cur_clone_root = sctx->clone_roots + i;
>> +		cur_clone_root->ino = (u64)-1;
>> +		cur_clone_root->offset = 0;
>> +		cur_clone_root->found_refs = 0;
>> +	}
>> +
>> +	backref_ctx.sctx = sctx;
>> +	backref_ctx.found = 0;
>> +	backref_ctx.cur_objectid = ino;
>> +	backref_ctx.cur_offset = data_offset;
>> +	backref_ctx.found_in_send_root = 0;
>> +	backref_ctx.extent_len = num_bytes;
>> +
>> +	/*
>> +	 * The last extent of a file may be too large due to page alignment.
>> +	 * We need to adjust extent_len in this case so that the checks in
>> +	 * __iterate_backrefs work.
>> +	 */
>> +	if (data_offset + num_bytes>= ino_size)
>> +		backref_ctx.extent_len = ino_size - data_offset;
>> +
>> +	/*
>> +	 * Now collect all backrefs.
>> +	 */
>> +	extent_item_pos = logical - found_key.objectid;
>> +	ret = iterate_extent_inodes(sctx->send_root->fs_info,
>> +					found_key.objectid, extent_item_pos, 1,
>> +					__iterate_backrefs,&backref_ctx);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	if (!backref_ctx.found_in_send_root) {
>> +		/* found a bug in backref code? */
>> +		ret = -EIO;
>> +		printk(KERN_ERR "btrfs: ERROR did not find backref in "
>> +				"send_root. inode=%llu, offset=%llu, "
>> +				"logical=%llu\n",
>> +				ino, data_offset, logical);
>> +		goto out;
>> +	}
>> +
>> +verbose_printk(KERN_DEBUG "btrfs: find_extent_clone: data_offset=%llu, "
>> +		"ino=%llu, "
>> +		"num_bytes=%llu, logical=%llu\n",
>> +		data_offset, ino, num_bytes, logical);
>> +
>> +	if (!backref_ctx.found)
>> +		verbose_printk("btrfs:    no clones found\n");
>> +
>> +	cur_clone_root = NULL;
>> +	for (i = 0; i<  sctx->clone_roots_cnt; i++) {
>> +		if (sctx->clone_roots[i].found_refs) {
>> +			if (!cur_clone_root)
>> +				cur_clone_root = sctx->clone_roots + i;
>> +			else if (sctx->clone_roots[i].root == sctx->send_root)
>> +				/* prefer clones from send_root over others */
>> +				cur_clone_root = sctx->clone_roots + i;
>> +			break;
>
> If you break after the first found ref, you might miss the send_root.
>
Ay, that's true. Removed the break.
>> +		}
>> +
>> +	}
>> +
>> +	if (cur_clone_root) {
>> +		*found = cur_clone_root;
>> +		ret = 0;
>> +	} else {
>> +		ret = -ENOENT;
>> +	}
>> +
>> +out:
>> +	btrfs_free_path(tmp_path);
>> +	return ret;
>> +}
>> +
>> +static int read_symlink(struct send_ctx *sctx,
>> +			struct btrfs_root *root,
>> +			u64 ino,
>> +			struct fs_path *dest)
>> +{
>> +	int ret;
>> +	struct btrfs_path *path;
>> +	struct btrfs_key key;
>> +	struct btrfs_file_extent_item *ei;
>> +	u8 type;
>> +	u8 compression;
>> +	unsigned long off;
>> +	int len;
>> +
>> +	path = alloc_path_for_send();
>> +	if (!path)
>> +		return -ENOMEM;
>> +
>> +	key.objectid = ino;
>> +	key.type = BTRFS_EXTENT_DATA_KEY;
>> +	key.offset = 0;
>> +	ret = btrfs_search_slot(NULL, root,&key, path, 0, 0);
>> +	if (ret<  0)
>> +		goto out;
>> +	BUG_ON(ret);
>> +
>> +	ei = btrfs_item_ptr(path->nodes[0], path->slots[0],
>> +			struct btrfs_file_extent_item);
>> +	type = btrfs_file_extent_type(path->nodes[0], ei);
>> +	compression = btrfs_file_extent_compression(path->nodes[0], ei);
>> +	BUG_ON(type != BTRFS_FILE_EXTENT_INLINE);
>> +	BUG_ON(compression);
>> +
>> +	off = btrfs_file_extent_inline_start(ei);
>> +	len = btrfs_file_extent_inline_len(path->nodes[0], ei);
>> +
>> +	ret = fs_path_add_from_extent_buffer(dest, path->nodes[0], off, len);
>> +	if (ret<  0)
>> +		goto out;
>
> superfluous
>
>> +
>> +out:
>> +	btrfs_free_path(path);
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Helper function to generate a file name that is unique in the root of
>> + * send_root and parent_root. This is used to generate names for orphan inodes.
>> + */
>> +static int gen_unique_name(struct send_ctx *sctx,
>> +			   u64 ino, u64 gen,
>> +			   struct fs_path *dest)
>> +{
>> +	int ret = 0;
>> +	struct btrfs_path *path;
>> +	struct btrfs_dir_item *di;
>> +	char tmp[64];
>> +	int len;
>> +	u64 idx = 0;
>> +
>> +	path = alloc_path_for_send();
>> +	if (!path)
>> +		return -ENOMEM;
>> +
>> +	while (1) {
>> +		len = snprintf(tmp, sizeof(tmp) - 1, "o%llu-%llu-%llu",
>> +				ino, gen, idx);
>
> wouldn't it be easier to just take a uuid? This would save you a lot
> of code and especially the need to verify that the name is really
> unique, saving seeks.
The answer from Alex Lyakas is correct here. We generate unique names 
that must be the same for every call with the same ino/gen combination.
>
>> +		if (len>= sizeof(tmp)) {
>> +			/* should really not happen */
>> +			ret = -EOVERFLOW;
>> +			goto out;
>> +		}
>> +
>> +		di = btrfs_lookup_dir_item(NULL, sctx->send_root,
>> +				path, BTRFS_FIRST_FREE_OBJECTID,
>> +				tmp, strlen(tmp), 0);
>> +		btrfs_release_path(path);
>> +		if (IS_ERR(di)) {
>> +			ret = PTR_ERR(di);
>> +			goto out;
>> +		}
>> +		if (di) {
>> +			/* not unique, try again */
>> +			idx++;
>> +			continue;
>> +		}
>> +
>> +		if (!sctx->parent_root) {
>> +			/* unique */
>> +			ret = 0;
>> +			break;
>> +		}
>> +
>> +		di = btrfs_lookup_dir_item(NULL, sctx->parent_root,
>> +				path, BTRFS_FIRST_FREE_OBJECTID,
>> +				tmp, strlen(tmp), 0);
>> +		btrfs_release_path(path);
>> +		if (IS_ERR(di)) {
>> +			ret = PTR_ERR(di);
>> +			goto out;
>> +		}
>> +		if (di) {
>> +			/* not unique, try again */
>> +			idx++;
>> +			continue;
>> +		}
>> +		/* unique */
>> +		break;
>> +	}
>> +
>> +	ret = fs_path_add(dest, tmp, strlen(tmp));
>> +
>> +out:
>> +	btrfs_free_path(path);
>> +	return ret;
>> +}
>> +
>> +enum inode_state {
>> +	inode_state_no_change,
>> +	inode_state_will_create,
>> +	inode_state_did_create,
>> +	inode_state_will_delete,
>> +	inode_state_did_delete,
>> +};
>> +
>> +static int get_cur_inode_state(struct send_ctx *sctx, u64 ino, u64 gen)
>
> don't you want to return a enum inode_state instead of int?
>
The function also returns error codes. Is it in such cases still 
preferred to return an enum?
>> +{
>> +	int ret;
>> +	int left_ret;
>> +	int right_ret;
>> +	u64 left_gen;
>> +	u64 right_gen;
>> +
>> +	ret = get_inode_info(sctx->send_root, ino, NULL,&left_gen, NULL, NULL,
>> +			NULL);
>> +	if (ret<  0&&  ret != -ENOENT)
>> +		goto out;
>> +	left_ret = ret;
>> +
>> +	if (!sctx->parent_root) {
>> +		right_ret = -ENOENT;
>> +	} else {
>> +		ret = get_inode_info(sctx->parent_root, ino, NULL,&right_gen,
>> +				NULL, NULL, NULL);
>> +		if (ret<  0&&  ret != -ENOENT)
>> +			goto out;
>> +		right_ret = ret;
>> +	}
>> +
>> +	if (!left_ret&&  !right_ret) {
>> +		if (left_gen == gen&&  right_gen == gen)
>
> Please also use {} here
>
Fixed.
>> +			ret = inode_state_no_change;
>> +		else if (left_gen == gen) {
>> +			if (ino<  sctx->send_progress)
>> +				ret = inode_state_did_create;
>> +			else
>> +				ret = inode_state_will_create;
>> +		} else if (right_gen == gen) {
>> +			if (ino<  sctx->send_progress)
>> +				ret = inode_state_did_delete;
>> +			else
>> +				ret = inode_state_will_delete;
>> +		} else  {
>> +			ret = -ENOENT;
>> +		}
>> +	} else if (!left_ret) {
>> +		if (left_gen == gen) {
>> +			if (ino<  sctx->send_progress)
>> +				ret = inode_state_did_create;
>> +			else
>> +				ret = inode_state_will_create;
>> +		} else {
>> +			ret = -ENOENT;
>> +		}
>> +	} else if (!right_ret) {
>> +		if (right_gen == gen) {
>> +			if (ino<  sctx->send_progress)
>> +				ret = inode_state_did_delete;
>> +			else
>> +				ret = inode_state_will_delete;
>> +		} else {
>> +			ret = -ENOENT;
>> +		}
>> +	} else {
>> +		ret = -ENOENT;
>> +	}
>> +
>> +out:
>> +	return ret;
>> +}
>> +
>> +static int is_inode_existent(struct send_ctx *sctx, u64 ino, u64 gen)
>> +{
>> +	int ret;
>> +
>> +	ret = get_cur_inode_state(sctx, ino, gen);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	if (ret == inode_state_no_change ||
>> +	    ret == inode_state_did_create ||
>> +	    ret == inode_state_will_delete)
>> +		ret = 1;
>> +	else
>> +		ret = 0;
>> +
>> +out:
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Helper function to lookup a dir item in a dir.
>> + */
>> +static int lookup_dir_item_inode(struct btrfs_root *root,
>> +				 u64 dir, const char *name, int name_len,
>> +				 u64 *found_inode,
>> +				 u8 *found_type)
>> +{
>> +	int ret = 0;
>> +	struct btrfs_dir_item *di;
>> +	struct btrfs_key key;
>> +	struct btrfs_path *path;
>> +
>> +	path = alloc_path_for_send();
>> +	if (!path)
>> +		return -ENOMEM;
>> +
>> +	di = btrfs_lookup_dir_item(NULL, root, path,
>> +			dir, name, name_len, 0);
>> +	if (!di) {
>> +		ret = -ENOENT;
>> +		goto out;
>> +	}
>> +	if (IS_ERR(di)) {
>> +		ret = PTR_ERR(di);
>> +		goto out;
>> +	}
>> +	btrfs_dir_item_key_to_cpu(path->nodes[0], di,&key);
>> +	*found_inode = key.objectid;
>> +	*found_type = btrfs_dir_type(path->nodes[0], di);
>> +
>> +out:
>> +	btrfs_free_path(path);
>> +	return ret;
>> +}
>> +
>> +static int get_first_ref(struct send_ctx *sctx,
>
> The name does not reflect well what the function does.
> It's more like get_first_parent_dir or get_first_inode_ref
>
I did not rename it as we have much more uses of xxx_ref function names 
which all would need to be renamed. I added a comment which should help 
to understand the purpose of this function.
>> +			 struct btrfs_root *root, u64 ino,
>> +			 u64 *dir, u64 *dir_gen, struct fs_path *name)
>> +{
>> +	int ret;
>> +	struct btrfs_key key;
>> +	struct btrfs_key found_key;
>> +	struct btrfs_path *path;
>> +	struct btrfs_inode_ref *iref;
>> +	int len;
>> +
>> +	path = alloc_path_for_send();
>> +	if (!path)
>> +		return -ENOMEM;
>> +
>> +	key.objectid = ino;
>> +	key.type = BTRFS_INODE_REF_KEY;
>> +	key.offset = 0;
>> +
>> +	ret = btrfs_search_slot_for_read(root,&key, path, 1, 0);
>> +	if (ret<  0)
>> +		goto out;
>> +	if (!ret)
>> +		btrfs_item_key_to_cpu(path->nodes[0],&found_key,
>> +				path->slots[0]);
>> +	if (ret || found_key.objectid != key.objectid ||
>> +	    found_key.type != key.type) {
>> +		ret = -ENOENT;
>> +		goto out;
>> +	}
>> +
>> +	iref = btrfs_item_ptr(path->nodes[0], path->slots[0],
>> +			struct btrfs_inode_ref);
>> +	len = btrfs_inode_ref_name_len(path->nodes[0], iref);
>> +	ret = fs_path_add_from_extent_buffer(name, path->nodes[0],
>> +			(unsigned long)(iref + 1), len);
>> +	if (ret<  0)
>> +		goto out;
>> +	btrfs_release_path(path);
>> +
>> +	ret = get_inode_info(root, found_key.offset, NULL, dir_gen, NULL, NULL,
>> +			NULL);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	*dir = found_key.offset;
>> +
>> +out:
>> +	btrfs_free_path(path);
>> +	return ret;
>> +}
>> +
>> +static int is_first_ref(struct send_ctx *sctx,
>> +			struct btrfs_root *root,
>> +			u64 ino, u64 dir,
>> +			const char *name, int name_len)
>> +{
>> +	int ret;
>> +	struct fs_path *tmp_name;
>> +	u64 tmp_dir;
>> +	u64 tmp_dir_gen;
>> +
>> +	tmp_name = fs_path_alloc(sctx);
>> +	if (!tmp_name)
>> +		return -ENOMEM;
>> +
>> +	ret = get_first_ref(sctx, root, ino,&tmp_dir,&tmp_dir_gen, tmp_name);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	if (name_len != fs_path_len(tmp_name)) {
>> +		ret = 0;
>> +		goto out;
>> +	}
>> +
>> +	ret = memcmp(tmp_name->start, name, name_len);
>
> or just ret = !memcmp...?
>
Changed to !memcmp.
>> +	if (ret)
>> +		ret = 0;
>> +	else
>> +		ret = 1;
>> +
>> +out:
>> +	fs_path_free(sctx, tmp_name);
>> +	return ret;
>> +}
>> +
>> +static int will_overwrite_ref(struct send_ctx *sctx, u64 dir, u64 dir_gen,
>> +			      const char *name, int name_len,
>> +			      u64 *who_ino, u64 *who_gen)
>> +{
>> +	int ret = 0;
>> +	u64 other_inode = 0;
>> +	u8 other_type = 0;
>> +
>> +	if (!sctx->parent_root)
>> +		goto out;
>> +
>> +	ret = is_inode_existent(sctx, dir, dir_gen);
>> +	if (ret<= 0)
>> +		goto out;
>> +
>> +	ret = lookup_dir_item_inode(sctx->parent_root, dir, name, name_len,
>> +			&other_inode,&other_type);
>> +	if (ret<  0&&  ret != -ENOENT)
>> +		goto out;
>> +	if (ret) {
>> +		ret = 0;
>> +		goto out;
>> +	}
>> +
>> +	if (other_inode>  sctx->send_progress) {
>
> I haven't really grasped what this function does (a comment would be
> nice), but I have a feeling that renames might break things when the
> parent is not a direct ancenstor. Maybe it gets clearer when I read
> on ;)
>
Hmm in my tests it worked. We have problems when the snapshots have no 
relation at all, which were reported by Alex Lyakas and which I'll go 
through later.
Added a comment.
>> +		ret = get_inode_info(sctx->parent_root, other_inode, NULL,
>> +				who_gen, NULL, NULL, NULL);
>> +		if (ret<  0)
>> +			goto out;
>> +
>> +		ret = 1;
>> +		*who_ino = other_inode;
>> +	} else {
>> +		ret = 0;
>> +	}
>> +
>> +out:
>> +	return ret;
>> +}
>> +
Added a comment here.
>> +static int did_overwrite_ref(struct send_ctx *sctx,
>> +			    u64 dir, u64 dir_gen,
>> +			    u64 ino, u64 ino_gen,
>> +			    const char *name, int name_len)
>> +{
>> +	int ret = 0;
>> +	u64 gen;
>> +	u64 ow_inode;
>> +	u8 other_type;
>> +
>> +	if (!sctx->parent_root)
>> +		goto out;
>> +
>> +	ret = is_inode_existent(sctx, dir, dir_gen);
>> +	if (ret<= 0)
>> +		goto out;
>> +
>> +	/* check if the ref was overwritten by another ref */
>> +	ret = lookup_dir_item_inode(sctx->send_root, dir, name, name_len,
>> +			&ow_inode,&other_type);
>> +	if (ret<  0&&  ret != -ENOENT)
>> +		goto out;
>> +	if (ret) {
>> +		/* was never and will never be overwritten */
>> +		ret = 0;
>> +		goto out;
>> +	}
>> +
>> +	ret = get_inode_info(sctx->send_root, ow_inode, NULL,&gen, NULL, NULL,
>> +			NULL);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	if (ow_inode == ino&&  gen == ino_gen) {
>> +		ret = 0;
>> +		goto out;
>> +	}
>> +
>> +	/* we know that it is or will be overwritten. check this now */
>> +	if (ow_inode<  sctx->send_progress)
>> +		ret = 1;
>> +	else
>> +		ret = 0;
>> +
>> +out:
>> +	return ret;
>> +}
>> +
Added a comment here.
>> +static int did_overwrite_first_ref(struct send_ctx *sctx, u64 ino, u64 gen)
>> +{
>> +	int ret = 0;
>> +	struct fs_path *name = NULL;
>> +	u64 dir;
>> +	u64 dir_gen;
>> +
>> +	if (!sctx->parent_root)
>> +		goto out;
>> +
>> +	name = fs_path_alloc(sctx);
>> +	if (!name)
>> +		return -ENOMEM;
>> +
>> +	ret = get_first_ref(sctx, sctx->parent_root, ino,&dir,&dir_gen, name);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	ret = did_overwrite_ref(sctx, dir, dir_gen, ino, gen,
>> +			name->start, fs_path_len(name));
>
>> +	if (ret<  0)
>> +		goto out;
>
> superfluous
>
Removed.
>> +
>> +out:
>> +	fs_path_free(sctx, name);
>> +	return ret;
>> +}
>> +
>> +static int name_cache_insert(struct send_ctx *sctx,
>> +			     struct name_cache_entry *nce)
>> +{
>> +	int ret = 0;
>> +	struct name_cache_entry **ncea;
>> +
>> +	ncea = radix_tree_lookup(&sctx->name_cache, nce->ino);
>
> attention: radix_trees take an unsigned long as index, and ino
> is a u64. You're in trouble on 32 bit.
>
Fixed by using the lower 32bits only for 32bit kernels and an additional 
list to store clashes.
>> +	if (ncea) {
>> +		if (!ncea[0])
>> +			ncea[0] = nce;
>> +		else if (!ncea[1])
>> +			ncea[1] = nce;
>> +		else
>> +			BUG();
>> +	} else {
>> +		ncea = kmalloc(sizeof(void *) * 2, GFP_NOFS);
>> +		if (!ncea)
>> +			return -ENOMEM;
>> +
>> +		ncea[0] = nce;
>> +		ncea[1] = NULL;
>> +		ret = radix_tree_insert(&sctx->name_cache, nce->ino, ncea);
>> +		if (ret<  0)
>> +			return ret;
>> +	}
>> +	list_add_tail(&nce->list,&sctx->name_cache_list);
>> +	sctx->name_cache_size++;
>> +
>> +	return ret;
>> +}
>> +
>> +static void name_cache_delete(struct send_ctx *sctx,
>> +			      struct name_cache_entry *nce)
>> +{
>> +	struct name_cache_entry **ncea;
>> +
>> +	ncea = radix_tree_lookup(&sctx->name_cache, nce->ino);
>> +	BUG_ON(!ncea);
>> +
>> +	if (ncea[0] == nce)
>> +		ncea[0] = NULL;
>> +	else if (ncea[1] == nce)
>> +		ncea[1] = NULL;
>> +	else
>> +		BUG();
>> +
>> +	if (!ncea[0]&&  !ncea[1]) {
>> +		radix_tree_delete(&sctx->name_cache, nce->ino);
>> +		kfree(ncea);
>> +	}
>> +
>> +	list_del(&nce->list);
>> +
>> +	sctx->name_cache_size--;
>> +}
>> +
>> +static struct name_cache_entry *name_cache_search(struct send_ctx *sctx,
>> +						    u64 ino, u64 gen)
>> +{
>> +	struct name_cache_entry **ncea;
>> +
>> +	ncea = radix_tree_lookup(&sctx->name_cache, ino);
>> +	if (!ncea)
>> +		return NULL;
>> +
>> +	if (ncea[0]&&  ncea[0]->gen == gen)
>> +		return ncea[0];
>> +	else if (ncea[1]&&  ncea[1]->gen == gen)
>> +		return ncea[1];
>> +	return NULL;
>> +}
>> +
>> +static void name_cache_used(struct send_ctx *sctx, struct name_cache_entry *nce)
>> +{
>> +	list_del(&nce->list);
>> +	list_add_tail(&nce->list,&sctx->name_cache_list);
>> +}
>> +
>> +static void name_cache_clean_unused(struct send_ctx *sctx)
>> +{
>> +	struct name_cache_entry *nce;
>> +
>> +	if (sctx->name_cache_size<  SEND_CTX_NAME_CACHE_CLEAN_SIZE)
>> +		return;
>
> superfluous, the while condition below is enough.
>
Please note that the if and the while use different constants. I want to 
trigger cleanup only after some time and then clean up multiple entries 
at once.
>> +
>> +	while (sctx->name_cache_size>  SEND_CTX_MAX_NAME_CACHE_SIZE) {
>> +		nce = list_entry(sctx->name_cache_list.next,
>> +				struct name_cache_entry, list);
>> +		name_cache_delete(sctx, nce);
>> +		kfree(nce);
>> +	}
>> +}
>> +
>> +static void name_cache_free(struct send_ctx *sctx)
>> +{
>> +	struct name_cache_entry *nce;
>> +	struct name_cache_entry *tmp;
>> +
>> +	list_for_each_entry_safe(nce, tmp,&sctx->name_cache_list, list) {
>
> it's easier to just always delete the head until the list is empty.
> Saves you the tmp-var.
>
Changed it to while(!list_empty...
>> +		name_cache_delete(sctx, nce);
>> +	}
>> +}
>> +
>> +static int __get_cur_name_and_parent(struct send_ctx *sctx,
>> +				     u64 ino, u64 gen,
>> +				     u64 *parent_ino,
>> +				     u64 *parent_gen,
>> +				     struct fs_path *dest)
>> +{
>> +	int ret;
>> +	int nce_ret;
>> +	struct btrfs_path *path = NULL;
>> +	struct name_cache_entry *nce = NULL;
>> +
>> +	nce = name_cache_search(sctx, ino, gen);
>> +	if (nce) {
>> +		if (ino<  sctx->send_progress&&  nce->need_later_update) {
>> +			name_cache_delete(sctx, nce);
>> +			kfree(nce);
>> +			nce = NULL;
>> +		} else {
>> +			name_cache_used(sctx, nce);
>> +			*parent_ino = nce->parent_ino;
>> +			*parent_gen = nce->parent_gen;
>> +			ret = fs_path_add(dest, nce->name, nce->name_len);
>> +			if (ret<  0)
>> +				goto out;
>> +			ret = nce->ret;
>> +			goto out;
>> +		}
>> +	}
>> +
>> +	path = alloc_path_for_send();
>> +	if (!path)
>> +		return -ENOMEM;
>> +
>> +	ret = is_inode_existent(sctx, ino, gen);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	if (!ret) {
>> +		ret = gen_unique_name(sctx, ino, gen, dest);
>> +		if (ret<  0)
>> +			goto out;
>> +		ret = 1;
>> +		goto out_cache;
>> +	}
>> +
>> +	if (ino<  sctx->send_progress)
>> +		ret = get_first_ref(sctx, sctx->send_root, ino,
>> +				parent_ino, parent_gen, dest);
>> +	else
>> +		ret = get_first_ref(sctx, sctx->parent_root, ino,
>> +				parent_ino, parent_gen, dest);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	ret = did_overwrite_ref(sctx, *parent_ino, *parent_gen, ino, gen,
>> +			dest->start, dest->end - dest->start);
>> +	if (ret<  0)
>> +		goto out;
>> +	if (ret) {
>> +		fs_path_reset(dest);
>> +		ret = gen_unique_name(sctx, ino, gen, dest);
>> +		if (ret<  0)
>> +			goto out;
>> +		ret = 1;
>> +	}
>> +
>> +out_cache:
>> +	nce = kmalloc(sizeof(*nce) + fs_path_len(dest) + 1, GFP_NOFS);
>> +	if (!nce) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	nce->ino = ino;
>> +	nce->gen = gen;
>> +	nce->parent_ino = *parent_ino;
>> +	nce->parent_gen = *parent_gen;
>> +	nce->name_len = fs_path_len(dest);
>> +	nce->ret = ret;
>
> This is a bit too magic for me. ret == 1 iff it's a unique_name?
Added more comments. 1 means get_cur_path needs to stop and that we have 
an orphan inode.
>
>> +	strcpy(nce->name, dest->start);
>> +	memset(&nce->use_list, 0, sizeof(nce->use_list));
>
> use_list is unused, anyway, it's a strange way to initialize a
> list_head. There's the INIT_LIST_HEAD macro.
use_list is removed now.
>
>> +
>> +	if (ino<  sctx->send_progress)
>> +		nce->need_later_update = 0;
>> +	else
>> +		nce->need_later_update = 1;
>> +
>> +	nce_ret = name_cache_insert(sctx, nce);
>> +	if (nce_ret<  0)
>> +		ret = nce_ret;
>> +	name_cache_clean_unused(sctx);
>> +
>> +out:
>> +	btrfs_free_path(path);
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Magic happens here. This function returns the first ref to an inode as it
>> + * would look like while receiving the stream at this point in time.
>> + * We walk the path up to the root. For every inode in between, we check if it
>> + * was already processed/sent. If yes, we continue with the parent as found
>> + * in send_root. If not, we continue with the parent as found in parent_root.
>> + * If we encounter an inode that was deleted at this point in time, we use the
>> + * inodes "orphan" name instead of the real name and stop. Same with new inodes
>> + * that were not created yet and overwritten inodes/refs.
>> + *
>> + * When do we have have orphan inodes:
>> + * 1. When an inode is freshly created and thus no valid refs are available yet
>> + * 2. When a directory lost all it's refs (deleted) but still has dir items
>> + *    inside which were not processed yet (pending for move/delete). If anyone
>> + *    tried to get the path to the dir items, it would get a path inside that
>> + *    orphan directory.
>> + * 3. When an inode is moved around or gets new links, it may overwrite the ref
>> + *    of an unprocessed inode. If in that case the first ref would be
>> + *    overwritten, the overwritten inode gets "orphanized". Later when we
>> + *    process this overwritten inode, it is restored at a new place by moving
>> + *    the orphan inode.
>> + *
>> + * sctx->send_progress tells this function at which point in time receiving
>> + * would be.
>> + */
>
> Thanks for the comment :)
>
>> +static int get_cur_path(struct send_ctx *sctx, u64 ino, u64 gen,
>> +			struct fs_path *dest)
>> +{
>> +	int ret = 0;
>> +	struct fs_path *name = NULL;
>> +	u64 parent_inode = 0;
>> +	u64 parent_gen = 0;
>> +	int stop = 0;
>> +
>> +	name = fs_path_alloc(sctx);
>> +	if (!name) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	dest->reversed = 1;
>> +	fs_path_reset(dest);
>> +
>> +	while (!stop&&  ino != BTRFS_FIRST_FREE_OBJECTID) {
>> +		fs_path_reset(name);
>> +
>> +		ret = __get_cur_name_and_parent(sctx, ino, gen,
>> +				&parent_inode,&parent_gen, name);
>> +		if (ret<  0)
>> +			goto out;
>> +		if (ret)
>> +			stop = 1;
>> +
>> +		ret = fs_path_add_path(dest, name);
>> +		if (ret<  0)
>> +			goto out;
>> +
>> +		ino = parent_inode;
>> +		gen = parent_gen;
>> +	}
>> +
>> +out:
>> +	fs_path_free(sctx, name);
>> +	if (!ret)
>> +		fs_path_unreverse(dest);
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Called for regular files when sending extents data. Opens a struct file
>> + * to read from the file.
>> + */
>> +static int open_cur_inode_file(struct send_ctx *sctx)
>> +{
>> +	int ret = 0;
>> +	struct btrfs_key key;
>> +	struct vfsmount *mnt;
>> +	struct inode *inode;
>> +	struct dentry *dentry;
>> +	struct file *filp;
>> +	int new = 0;
>> +
>> +	if (sctx->cur_inode_filp)
>> +		goto out;
>> +
>> +	key.objectid = sctx->cur_ino;
>> +	key.type = BTRFS_INODE_ITEM_KEY;
>> +	key.offset = 0;
>> +
>> +	inode = btrfs_iget(sctx->send_root->fs_info->sb,&key, sctx->send_root,
>> +			&new);
>> +	if (IS_ERR(inode)) {
>> +		ret = PTR_ERR(inode);
>> +		goto out;
>> +	}
>> +
>> +	dentry = d_obtain_alias(inode);
>> +	inode = NULL;
>> +	if (IS_ERR(dentry)) {
>> +		ret = PTR_ERR(dentry);
>> +		goto out;
>> +	}
>> +
>> +	mnt = mntget(sctx->mnt);
>> +	filp = dentry_open(dentry, mnt, O_RDONLY | O_LARGEFILE, current_cred());
>> +	dentry = NULL;
>> +	mnt = NULL;
>
> It would be good if this part could be reviewed by someone with
> deep vfs knowledge. Maybe you can compile those parts into a
> separate patch and send it to the appropriate ppl for review.
>
Linus had to merge parts of this function and did not complain. I will 
probably still send a new mail regarding this and the other vfs parts.
>> +	if (IS_ERR(filp)) {
>> +		ret = PTR_ERR(filp);
>> +		goto out;
>> +	}
>> +	sctx->cur_inode_filp = filp;
>> +
>> +out:
>> +	/*
>> +	 * no xxxput required here as every vfs op
>> +	 * does it by itself on failure
>> +	 */
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Closes the struct file that was created in open_cur_inode_file
>> + */
>> +static int close_cur_inode_file(struct send_ctx *sctx)
>> +{
>> +	int ret = 0;
>> +
>> +	if (!sctx->cur_inode_filp)
>> +		goto out;
>> +
>> +	ret = filp_close(sctx->cur_inode_filp, NULL);
>> +	sctx->cur_inode_filp = NULL;
>> +
>> +out:
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Sends a BTRFS_SEND_C_SUBVOL command/item to userspace
>> + */
>> +static int send_subvol_begin(struct send_ctx *sctx)
>> +{
>> +	int ret;
>> +	struct btrfs_root *send_root = sctx->send_root;
>> +	struct btrfs_root *parent_root = sctx->parent_root;
>> +	struct btrfs_path *path;
>> +	struct btrfs_key key;
>> +	struct btrfs_root_ref *ref;
>> +	struct extent_buffer *leaf;
>> +	char *name = NULL;
>> +	int namelen;
>> +
>> +	path = alloc_path_for_send();
>> +	if (!path)
>> +		return -ENOMEM;
>> +
>> +	name = kmalloc(BTRFS_PATH_NAME_MAX, GFP_NOFS);
>> +	if (!name) {
>> +		btrfs_free_path(path);
>> +		return -ENOMEM;
>> +	}
>> +
>> +	key.objectid = send_root->objectid;
>> +	key.type = BTRFS_ROOT_BACKREF_KEY;
>> +	key.offset = 0;
>> +
>> +	ret = btrfs_search_slot_for_read(send_root->fs_info->tree_root,
>> +				&key, path, 1, 0);
>> +	if (ret<  0)
>> +		goto out;
>> +	if (ret) {
>> +		ret = -ENOENT;
>> +		goto out;
>> +	}
>> +
>> +	leaf = path->nodes[0];
>> +	btrfs_item_key_to_cpu(leaf,&key, path->slots[0]);
>> +	if (key.type != BTRFS_ROOT_BACKREF_KEY ||
>> +	    key.objectid != send_root->objectid) {
>> +		ret = -ENOENT;
>> +		goto out;
>> +	}
>
> It looks like we could use a helper for finding the first entry
> with a specific objectid+key...
>
Hmm yepp, I have a lot of places where things like this happen. Will do 
that later.
>> +	ref = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_root_ref);
>> +	namelen = btrfs_root_ref_name_len(leaf, ref);
>> +	read_extent_buffer(leaf, name, (unsigned long)(ref + 1), namelen);
>> +	btrfs_release_path(path);
>> +
>> +	if (ret<  0)
>> +		goto out;
>
> How can ret be<  0 here?
Whoops, a leftover. Removed.
>
>> +
>> +	if (parent_root) {
>> +		ret = begin_cmd(sctx, BTRFS_SEND_C_SNAPSHOT);
>> +		if (ret<  0)
>> +			goto out;
>> +	} else {
>> +		ret = begin_cmd(sctx, BTRFS_SEND_C_SUBVOL);
>> +		if (ret<  0)
>> +			goto out;
>> +	}
>> +
>> +	TLV_PUT_STRING(sctx, BTRFS_SEND_A_PATH, name, namelen);
>
> It's called PATH, but it seems to be only the last path component.
> What about subvols that are ancored deeper in the dir tree?
>
Sounds like btrfs_root_ref_name_len does not contain the full path? If 
not, need to handle that.
>> +	TLV_PUT_UUID(sctx, BTRFS_SEND_A_UUID,
>> +			sctx->send_root->root_item.uuid);
>> +	TLV_PUT_U64(sctx, BTRFS_SEND_A_CTRANSID,
>> +			sctx->send_root->root_item.ctransid);
>> +	if (parent_root) {
>
> The name of the parent is not sent?
>
Nope. We can't use it for anything. And when we later allow to receive 
to an arbitrary path, we can't count on the parent path/name but only on 
the uuid of the parent.
>> +		TLV_PUT_UUID(sctx, BTRFS_SEND_A_CLONE_UUID,
>> +				sctx->parent_root->root_item.uuid);
>> +		TLV_PUT_U64(sctx, BTRFS_SEND_A_CLONE_CTRANSID,
>> +				sctx->parent_root->root_item.ctransid);
>> +	}
>> +
>> +	ret = send_cmd(sctx);
>> +
>> +tlv_put_failure:
>> +out:
>> +	btrfs_free_path(path);
>> +	kfree(name);
>> +	return ret;
>> +}
>> +
>> +static int send_truncate(struct send_ctx *sctx, u64 ino, u64 gen, u64 size)
>> +{
>> +	int ret = 0;
>> +	struct fs_path *p;
>> +
>> +verbose_printk("btrfs: send_truncate %llu size=%llu\n", ino, size);
>> +
>> +	p = fs_path_alloc(sctx);
>> +	if (!p)
>> +		return -ENOMEM;
>> +
>> +	ret = begin_cmd(sctx, BTRFS_SEND_C_TRUNCATE);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	ret = get_cur_path(sctx, ino, gen, p);
>> +	if (ret<  0)
>> +		goto out;
>> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
>> +	TLV_PUT_U64(sctx, BTRFS_SEND_A_SIZE, size);
>> +
>> +	ret = send_cmd(sctx);
>> +
>> +tlv_put_failure:
>> +out:
>> +	fs_path_free(sctx, p);
>> +	return ret;
>> +}
>> +
>> +static int send_chmod(struct send_ctx *sctx, u64 ino, u64 gen, u64 mode)
>> +{
>> +	int ret = 0;
>> +	struct fs_path *p;
>> +
>> +verbose_printk("btrfs: send_chmod %llu mode=%llu\n", ino, mode);
>> +
>> +	p = fs_path_alloc(sctx);
>> +	if (!p)
>> +		return -ENOMEM;
>> +
>> +	ret = begin_cmd(sctx, BTRFS_SEND_C_CHMOD);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	ret = get_cur_path(sctx, ino, gen, p);
>> +	if (ret<  0)
>> +		goto out;
>> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
>> +	TLV_PUT_U64(sctx, BTRFS_SEND_A_MODE, mode&  07777);
>
> four 7?
>
>> +
>> +	ret = send_cmd(sctx);
>> +
>> +tlv_put_failure:
>> +out:
>> +	fs_path_free(sctx, p);
>> +	return ret;
>> +}
>> +
>> +static int send_chown(struct send_ctx *sctx, u64 ino, u64 gen, u64 uid, u64 gid)
>> +{
>> +	int ret = 0;
>> +	struct fs_path *p;
>> +
>> +verbose_printk("btrfs: send_chown %llu uid=%llu, gid=%llu\n", ino, uid, gid);
>> +
>> +	p = fs_path_alloc(sctx);
>> +	if (!p)
>> +		return -ENOMEM;
>> +
>> +	ret = begin_cmd(sctx, BTRFS_SEND_C_CHOWN);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	ret = get_cur_path(sctx, ino, gen, p);
>> +	if (ret<  0)
>> +		goto out;
>> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
>> +	TLV_PUT_U64(sctx, BTRFS_SEND_A_UID, uid);
>> +	TLV_PUT_U64(sctx, BTRFS_SEND_A_GID, gid);
>> +
>> +	ret = send_cmd(sctx);
>> +
>> +tlv_put_failure:
>> +out:
>> +	fs_path_free(sctx, p);
>> +	return ret;
>> +}
>> +
>> +static int send_utimes(struct send_ctx *sctx, u64 ino, u64 gen)
>> +{
>> +	int ret = 0;
>> +	struct fs_path *p = NULL;
>> +	struct btrfs_inode_item *ii;
>> +	struct btrfs_path *path = NULL;
>> +	struct extent_buffer *eb;
>> +	struct btrfs_key key;
>> +	int slot;
>> +
>> +verbose_printk("btrfs: send_utimes %llu\n", ino);
>> +
>> +	p = fs_path_alloc(sctx);
>> +	if (!p)
>> +		return -ENOMEM;
>> +
>> +	path = alloc_path_for_send();
>> +	if (!path) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	key.objectid = ino;
>> +	key.type = BTRFS_INODE_ITEM_KEY;
>> +	key.offset = 0;
>> +	ret = btrfs_search_slot(NULL, sctx->send_root,&key, path, 0, 0);
>> +	if (ret<  0)
>> +		goto out;
>
> you don't check for existence. I guess you know it exists, otherwise
> you wouldn't end up here...
>
Yepp, calling this function (and other send_xxx functions) is only 
allowed on send_root and with existing inodes.
>> +
>> +	eb = path->nodes[0];
>> +	slot = path->slots[0];
>> +	ii = btrfs_item_ptr(eb, slot, struct btrfs_inode_item);
>> +
>> +	ret = begin_cmd(sctx, BTRFS_SEND_C_UTIMES);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	ret = get_cur_path(sctx, ino, gen, p);
>> +	if (ret<  0)
>> +		goto out;
>> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
>> +	TLV_PUT_BTRFS_TIMESPEC(sctx, BTRFS_SEND_A_ATIME, eb,
>> +			btrfs_inode_atime(ii));
>> +	TLV_PUT_BTRFS_TIMESPEC(sctx, BTRFS_SEND_A_MTIME, eb,
>> +			btrfs_inode_mtime(ii));
>> +	TLV_PUT_BTRFS_TIMESPEC(sctx, BTRFS_SEND_A_CTIME, eb,
>> +			btrfs_inode_ctime(ii));
>> +	/* TODO otime? */
>
> yes, please :)
>
Can't do that for now. We need to wait for the otime patches to come 
into upstream before. I changed the comment to make this more clear and 
also added a TODO to
https://btrfs.wiki.kernel.org/index.php/Btrfs_Send/Receive
>> +
>> +	ret = send_cmd(sctx);
>> +
>> +tlv_put_failure:
>> +out:
>> +	fs_path_free(sctx, p);
>> +	btrfs_free_path(path);
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Sends a BTRFS_SEND_C_MKXXX or SYMLINK command to user space. We don't have
>> + * a valid path yet because we did not process the refs yet. So, the inode
>> + * is created as orphan.
>> + */
>> +static int send_create_inode(struct send_ctx *sctx, struct btrfs_path *path,
>> +			     struct btrfs_key *key)
>> +{
>> +	int ret = 0;
>> +	struct extent_buffer *eb = path->nodes[0];
>> +	struct btrfs_inode_item *ii;
>> +	struct fs_path *p;
>> +	int slot = path->slots[0];
>> +	int cmd;
>> +	u64 mode;
>> +
>> +verbose_printk("btrfs: send_create_inode %llu\n", sctx->cur_ino);
>> +
>> +	p = fs_path_alloc(sctx);
>> +	if (!p)
>> +		return -ENOMEM;
>> +
>> +	ii = btrfs_item_ptr(eb, slot, struct btrfs_inode_item);
>> +	mode = btrfs_inode_mode(eb, ii);
>> +
>> +	if (S_ISREG(mode))
>> +		cmd = BTRFS_SEND_C_MKFILE;
>> +	else if (S_ISDIR(mode))
>> +		cmd = BTRFS_SEND_C_MKDIR;
>> +	else if (S_ISLNK(mode))
>> +		cmd = BTRFS_SEND_C_SYMLINK;
>> +	else if (S_ISCHR(mode) || S_ISBLK(mode))
>> +		cmd = BTRFS_SEND_C_MKNOD;
>> +	else if (S_ISFIFO(mode))
>> +		cmd = BTRFS_SEND_C_MKFIFO;
>> +	else if (S_ISSOCK(mode))
>> +		cmd = BTRFS_SEND_C_MKSOCK;
>> +	else {
>
> normally you'd put {} in all cases if you need it for one.
>
Fixed that.
>> +		printk(KERN_WARNING "btrfs: unexpected inode type %o",
>> +				(int)(mode&  S_IFMT));
>> +		ret = -ENOTSUPP;
>> +		goto out;
>> +	}
>> +
>> +	ret = begin_cmd(sctx, cmd);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	ret = gen_unique_name(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
>> +
>> +	if (S_ISLNK(mode)) {
>> +		fs_path_reset(p);
>> +		ret = read_symlink(sctx, sctx->send_root, sctx->cur_ino, p);
>> +		if (ret<  0)
>> +			goto out;
>> +		TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH_LINK, p);
>> +	} else if (S_ISCHR(mode) || S_ISBLK(mode) ||
>> +		   S_ISFIFO(mode) || S_ISSOCK(mode)) {
>> +		TLV_PUT_U64(sctx, BTRFS_SEND_A_RDEV, btrfs_inode_rdev(eb, ii));
>> +	}
>> +
>> +	ret = send_cmd(sctx);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +
>> +tlv_put_failure:
>> +out:
>> +	fs_path_free(sctx, p);
>> +	return ret;
>> +}
>> +
>> +struct recorded_ref {
>> +	struct list_head list;
>> +	char *dir_path;
>> +	char *name;
>> +	struct fs_path *full_path;
>> +	u64 dir;
>> +	u64 dir_gen;
>> +	int dir_path_len;
>> +	int name_len;
>> +};
>> +
>> +/*
>> + * We need to process new refs before deleted refs, but compare_tree gives us
>> + * everything mixed. So we first record all refs and later process them.
>> + * This function is a helper to record one ref.
>> + */
>> +static int record_ref(struct list_head *head, u64 dir,
>> +		      u64 dir_gen, struct fs_path *path)
>> +{
>> +	struct recorded_ref *ref;
>> +	char *tmp;
>> +
>> +	ref = kmalloc(sizeof(*ref), GFP_NOFS);
>> +	if (!ref)
>> +		return -ENOMEM;
>> +
>> +	ref->dir = dir;
>> +	ref->dir_gen = dir_gen;
>> +	ref->full_path = path;
>> +
>> +	tmp = strrchr(ref->full_path->start, '/');
>> +	if (!tmp) {
>> +		ref->name_len = ref->full_path->end - ref->full_path->start;
>> +		ref->name = ref->full_path->start;
>> +		ref->dir_path_len = 0;
>> +		ref->dir_path = ref->full_path->start;
>> +	} else {
>> +		tmp++;
>> +		ref->name_len = ref->full_path->end - tmp;
>> +		ref->name = tmp;
>> +		ref->dir_path = ref->full_path->start;
>> +		ref->dir_path_len = ref->full_path->end -
>> +				ref->full_path->start - 1 - ref->name_len;
>> +	}
>> +
>> +	list_add_tail(&ref->list, head);
>> +	return 0;
>> +}
>> +
>> +static void __free_recorded_refs(struct send_ctx *sctx, struct list_head *head)
>> +{
>> +	struct recorded_ref *cur;
>> +	struct recorded_ref *tmp;
>> +
>> +	list_for_each_entry_safe(cur, tmp, head, list) {
>> +		fs_path_free(sctx, cur->full_path);
>> +		kfree(cur);
>> +	}
>> +	INIT_LIST_HEAD(head);
>
> This is a bit non-obvious. You use the _safe-macro as if you're
> going to delete each entry, but then you don't delete it and
> instead just reset the head. I'd prefer a while(!list_empty())-
> list_del-loop here.
>
Changed to use while(!list_empty...
>> +}
>> +
>> +static void free_recorded_refs(struct send_ctx *sctx)
>> +{
>> +	__free_recorded_refs(sctx,&sctx->new_refs);
>> +	__free_recorded_refs(sctx,&sctx->deleted_refs);
>> +}
>> +
>> +/*
>> + * Renames/moves a file/dir to it's orphan name. Used when the first
>                                    its
>
>> + * ref of an unprocessed inode gets overwritten and for all non empty
>> + * directories.
>> + */
>> +static int orphanize_inode(struct send_ctx *sctx, u64 ino, u64 gen,
>> +			  struct fs_path *path)
>> +{
>> +	int ret;
>> +	struct fs_path *orphan;
>> +
>> +	orphan = fs_path_alloc(sctx);
>> +	if (!orphan)
>> +		return -ENOMEM;
>> +
>> +	ret = gen_unique_name(sctx, ino, gen, orphan);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	ret = send_rename(sctx, path, orphan);
>> +
>> +out:
>> +	fs_path_free(sctx, orphan);
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Returns 1 if a directory can be removed at this point in time.
>> + * We check this by iterating all dir items and checking if the inode behind
>> + * the dir item was already processed.
>> + */
>> +static int can_rmdir(struct send_ctx *sctx, u64 dir, u64 send_progress)
>> +{
>> +	int ret = 0;
>> +	struct btrfs_root *root = sctx->parent_root;
>> +	struct btrfs_path *path;
>> +	struct btrfs_key key;
>> +	struct btrfs_key found_key;
>> +	struct btrfs_key loc;
>> +	struct btrfs_dir_item *di;
>> +
>> +	path = alloc_path_for_send();
>> +	if (!path)
>> +		return -ENOMEM;
>> +
>> +	key.objectid = dir;
>> +	key.type = BTRFS_DIR_INDEX_KEY;
>> +	key.offset = 0;
>> +
>> +	while (1) {
>> +		ret = btrfs_search_slot_for_read(root,&key, path, 1, 0);
>> +		if (ret<  0)
>> +			goto out;
>> +		if (!ret) {
>> +			btrfs_item_key_to_cpu(path->nodes[0],&found_key,
>> +					path->slots[0]);
>> +		}
>> +		if (ret || found_key.objectid != key.objectid ||
>> +		    found_key.type != key.type) {
>> +			break;
>> +		}
>
> another case for the above mentioned helper...
>
>> +
>> +		di = btrfs_item_ptr(path->nodes[0], path->slots[0],
>> +				struct btrfs_dir_item);
>> +		btrfs_dir_item_key_to_cpu(path->nodes[0], di,&loc);
>> +
>> +		if (loc.objectid>  send_progress) {
>> +			ret = 0;
>> +			goto out;
>> +		}
>> +
>> +		btrfs_release_path(path);
>> +		key.offset = found_key.offset + 1;
>> +	}
>> +
>> +	ret = 1;
>> +
>> +out:
>> +	btrfs_free_path(path);
>> +	return ret;
>> +}
>> +
>> +/*
>> + * This does all the move/link/unlink/rmdir magic.
>> + */
>> +static int process_recorded_refs(struct send_ctx *sctx)
>> +{
>> +	int ret = 0;
>> +	struct recorded_ref *cur;
>> +	struct ulist *check_dirs = NULL;
>> +	struct ulist_iterator uit;
>> +	struct ulist_node *un;
>> +	struct fs_path *valid_path = NULL;
>> +	u64 ow_inode;
>> +	u64 ow_gen;
>> +	int did_overwrite = 0;
>> +	int is_orphan = 0;
>> +
>> +verbose_printk("btrfs: process_recorded_refs %llu\n", sctx->cur_ino);
>> +
>> +	valid_path = fs_path_alloc(sctx);
>> +	if (!valid_path) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	check_dirs = ulist_alloc(GFP_NOFS);
>> +	if (!check_dirs) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	/*
>> +	 * First, check if the first ref of the current inode was overwritten
>> +	 * before. If yes, we know that the current inode was already orphanized
>> +	 * and thus use the orphan name. If not, we can use get_cur_path to
>> +	 * get the path of the first ref as it would like while receiving at
>> +	 * this point in time.
>> +	 * New inodes are always orphan at the beginning, so force to use the
>> +	 * orphan name in this case.
>> +	 * The first ref is stored in valid_path and will be updated if it
>> +	 * gets moved around.
>> +	 */
>> +	if (!sctx->cur_inode_new) {
>> +		ret = did_overwrite_first_ref(sctx, sctx->cur_ino,
>> +				sctx->cur_inode_gen);
>> +		if (ret<  0)
>> +			goto out;
>> +		if (ret)
>> +			did_overwrite = 1;
>> +	}
>> +	if (sctx->cur_inode_new || did_overwrite) {
>> +		ret = gen_unique_name(sctx, sctx->cur_ino,
>> +				sctx->cur_inode_gen, valid_path);
>> +		if (ret<  0)
>> +			goto out;
>> +		is_orphan = 1;
>> +	} else {
>> +		ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen,
>> +				valid_path);
>> +		if (ret<  0)
>> +			goto out;
>> +	}
>> +
>> +	list_for_each_entry(cur,&sctx->new_refs, list) {
>> +		/*
>> +		 * Check if this new ref would overwrite the first ref of
>> +		 * another unprocessed inode. If yes, orphanize the
>> +		 * overwritten inode. If we find an overwritten ref that is
>> +		 * not the first ref, simply unlink it.
>> +		 */
>> +		ret = will_overwrite_ref(sctx, cur->dir, cur->dir_gen,
>> +				cur->name, cur->name_len,
>> +				&ow_inode,&ow_gen);
>> +		if (ret<  0)
>> +			goto out;
>> +		if (ret) {
>> +			ret = is_first_ref(sctx, sctx->parent_root,
>> +					ow_inode, cur->dir, cur->name,
>> +					cur->name_len);
>> +			if (ret<  0)
>> +				goto out;
>> +			if (ret) {
>> +				ret = orphanize_inode(sctx, ow_inode, ow_gen,
>> +						cur->full_path);
>> +				if (ret<  0)
>> +					goto out;
>> +			} else {
>> +				ret = send_unlink(sctx, cur->full_path);
>> +				if (ret<  0)
>> +					goto out;
>> +			}
>> +		}
>> +
>> +		/*
>> +		 * link/move the ref to the new place. If we have an orphan
>> +		 * inode, move it and update valid_path. If not, link or move
>> +		 * it depending on the inode mode.
>> +		 */
>> +		if (is_orphan) {
>> +			ret = send_rename(sctx, valid_path, cur->full_path);
>> +			if (ret<  0)
>> +				goto out;
>> +			is_orphan = 0;
>> +			ret = fs_path_copy(valid_path, cur->full_path);
>> +			if (ret<  0)
>> +				goto out;
>> +		} else {
>> +			if (S_ISDIR(sctx->cur_inode_mode)) {
>
> why not save a level of indentation here by using<else if>?
>
The if does not exist anymore due to a recent patch.
>> +				/*
>> +				 * Dirs can't be linked, so move it. For moved
>> +				 * dirs, we always have one new and one deleted
>> +				 * ref. The deleted ref is ignored later.
>> +				 */
>> +				ret = send_rename(sctx, valid_path,
>> +						cur->full_path);
>> +				if (ret<  0)
>> +					goto out;
>> +				ret = fs_path_copy(valid_path, cur->full_path);
>> +				if (ret<  0)
>> +					goto out;
>> +			} else {
>> +				ret = send_link(sctx, valid_path,
>> +						cur->full_path);
>> +				if (ret<  0)
>> +					goto out;
>> +			}
>> +		}
>> +		ret = ulist_add(check_dirs, cur->dir, cur->dir_gen,
>
> careful, aux is only an unsigned long, meant to be as large as a pointer.
>
Will make aux 64bit.
>> +				GFP_NOFS);
>> +		if (ret<  0)
>> +			goto out;
>> +	}
>> +
>> +	if (S_ISDIR(sctx->cur_inode_mode)&&  sctx->cur_inode_deleted) {
>> +		/*
>> +		 * Check if we can already rmdir the directory. If not,
>> +		 * orphanize it. For every dir item inside that gets deleted
>> +		 * later, we do this check again and rmdir it then if possible.
>> +		 * See the use of check_dirs for more details.
>> +		 */
>> +		ret = can_rmdir(sctx, sctx->cur_ino, sctx->cur_ino);
>> +		if (ret<  0)
>> +			goto out;
>> +		if (ret) {
>> +			ret = send_rmdir(sctx, valid_path);
>> +			if (ret<  0)
>> +				goto out;
>> +		} else if (!is_orphan) {
>> +			ret = orphanize_inode(sctx, sctx->cur_ino,
>> +					sctx->cur_inode_gen, valid_path);
>> +			if (ret<  0)
>> +				goto out;
>> +			is_orphan = 1;
>> +		}
>> +
>> +		list_for_each_entry(cur,&sctx->deleted_refs, list) {
>> +			ret = ulist_add(check_dirs, cur->dir, cur->dir_gen,
>> +					GFP_NOFS);
>> +			if (ret<  0)
>> +				goto out;
>> +		}
>> +	} else if (!S_ISDIR(sctx->cur_inode_mode)) {
>> +		/*
>> +		 * We have a non dir inode. Go through all deleted refs and
>> +		 * unlink them if they were not already overwritten by other
>> +		 * inodes.
>> +		 */
>> +		list_for_each_entry(cur,&sctx->deleted_refs, list) {
>> +			ret = did_overwrite_ref(sctx, cur->dir, cur->dir_gen,
>> +					sctx->cur_ino, sctx->cur_inode_gen,
>> +					cur->name, cur->name_len);
>> +			if (ret<  0)
>> +				goto out;
>> +			if (!ret) {
>> +				ret = send_unlink(sctx, cur->full_path);
>> +				if (ret<  0)
>> +					goto out;
>> +			}
>> +			ret = ulist_add(check_dirs, cur->dir, cur->dir_gen,
>> +					GFP_NOFS);
>> +			if (ret<  0)
>> +				goto out;
>> +		}
>> +
>> +		/*
>> +		 * If the inode is still orphan, unlink the orphan. This may
>> +		 * happen when a previous inode did overwrite the first ref
>> +		 * of this inode and no new refs were added for the current
>> +		 * inode.
>> +		 */
>> +		if (is_orphan) {
>> +			ret = send_unlink(sctx, valid_path);
>> +			if (ret<  0)
>> +				goto out;
>> +		}
>> +	}
>> +
>> +	/*
>> +	 * We did collect all parent dirs where cur_inode was once located. We
>> +	 * now go through all these dirs and check if they are pending for
>> +	 * deletion and if it's finally possible to perform the rmdir now.
>> +	 * We also update the inode stats of the parent dirs here.
>> +	 */
>> +	ULIST_ITER_INIT(&uit);
>> +	while ((un = ulist_next(check_dirs,&uit))) {
>> +		if (un->val>  sctx->cur_ino)
>> +			continue;
>> +
>> +		ret = get_cur_inode_state(sctx, un->val, un->aux);
>> +		if (ret<  0)
>> +			goto out;
>> +
>> +		if (ret == inode_state_did_create ||
>> +		    ret == inode_state_no_change) {
>> +			/* TODO delayed utimes */
>> +			ret = send_utimes(sctx, un->val, un->aux);
>> +			if (ret<  0)
>> +				goto out;
>> +		} else if (ret == inode_state_did_delete) {
>> +			ret = can_rmdir(sctx, un->val, sctx->cur_ino);
>> +			if (ret<  0)
>> +				goto out;
>> +			if (ret) {
>> +				ret = get_cur_path(sctx, un->val, un->aux,
>> +						valid_path);
>> +				if (ret<  0)
>> +					goto out;
>> +				ret = send_rmdir(sctx, valid_path);
>> +				if (ret<  0)
>> +					goto out;
>> +			}
>> +		}
>> +	}
>> +
>> +	/*
>> +	 * Current inode is now at it's new position, so we must increase
>                                     its
>> +	 * send_progress
>> +	 */
>> +	sctx->send_progress = sctx->cur_ino + 1;
>
> is this the right place for it, or should be done at the calling
> site?
>
You're right, the caller should update send_progress. Especially as 
currently send_progress gets updated too early in the cur_inode_new_gen 
case (as reported by Alex Lyakas).
>> +
>> +	ret = 0;
>> +
>> +out:
>> +	free_recorded_refs(sctx);
>> +	ulist_free(check_dirs);
>> +	fs_path_free(sctx, valid_path);
>> +	return ret;
>> +}
>> +
>> +static int __record_new_ref(int num, u64 dir, int index,
>> +			    struct fs_path *name,
>> +			    void *ctx)
>> +{
>> +	int ret = 0;
>> +	struct send_ctx *sctx = ctx;
>> +	struct fs_path *p;
>> +	u64 gen;
>> +
>> +	p = fs_path_alloc(sctx);
>> +	if (!p)
>> +		return -ENOMEM;
>> +
>> +	ret = get_inode_info(sctx->send_root, dir, NULL,&gen, NULL, NULL,
>> +			NULL);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	ret = get_cur_path(sctx, dir, gen, p);
>> +	if (ret<  0)
>> +		goto out;
>> +	ret = fs_path_add_path(p, name);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	ret = record_ref(&sctx->new_refs, dir, gen, p);
>> +
>> +out:
>> +	if (ret)
>> +		fs_path_free(sctx, p);
>> +	return ret;
>> +}
>> +
>> +static int __record_deleted_ref(int num, u64 dir, int index,
>> +				struct fs_path *name,
>> +				void *ctx)
>> +{
>> +	int ret = 0;
>> +	struct send_ctx *sctx = ctx;
>> +	struct fs_path *p;
>> +	u64 gen;
>> +
>> +	p = fs_path_alloc(sctx);
>> +	if (!p)
>> +		return -ENOMEM;
>> +
>> +	ret = get_inode_info(sctx->parent_root, dir, NULL,&gen, NULL, NULL,
>> +			NULL);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	ret = get_cur_path(sctx, dir, gen, p);
>> +	if (ret<  0)
>> +		goto out;
>> +	ret = fs_path_add_path(p, name);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	ret = record_ref(&sctx->deleted_refs, dir, gen, p);
>> +
>> +out:
>> +	if (ret)
>> +		fs_path_free(sctx, p);
>> +	return ret;
>> +}
>> +
>> +static int record_new_ref(struct send_ctx *sctx)
>> +{
>> +	int ret;
>> +
>> +	ret = iterate_inode_ref(sctx, sctx->send_root, sctx->left_path,
>> +			sctx->cmp_key, 0, __record_new_ref, sctx);
>> +
>> +	return ret;
>> +}
>> +
>> +static int record_deleted_ref(struct send_ctx *sctx)
>> +{
>> +	int ret;
>> +
>> +	ret = iterate_inode_ref(sctx, sctx->parent_root, sctx->right_path,
>> +			sctx->cmp_key, 0, __record_deleted_ref, sctx);
>> +	return ret;
>> +}
>> +
>> +struct find_ref_ctx {
>> +	u64 dir;
>> +	struct fs_path *name;
>> +	int found_idx;
>> +};
>> +
>> +static int __find_iref(int num, u64 dir, int index,
>> +		       struct fs_path *name,
>> +		       void *ctx_)
>> +{
>> +	struct find_ref_ctx *ctx = ctx_;
>> +
>> +	if (dir == ctx->dir&&  fs_path_len(name) == fs_path_len(ctx->name)&&
>> +	    strncmp(name->start, ctx->name->start, fs_path_len(name)) == 0) {
>> +		ctx->found_idx = num;
>> +		return 1;
>> +	}
>> +	return 0;
>> +}
>> +
>> +static int find_iref(struct send_ctx *sctx,
>> +		     struct btrfs_root *root,
>> +		     struct btrfs_path *path,
>> +		     struct btrfs_key *key,
>> +		     u64 dir, struct fs_path *name)
>> +{
>> +	int ret;
>> +	struct find_ref_ctx ctx;
>> +
>> +	ctx.dir = dir;
>> +	ctx.name = name;
>> +	ctx.found_idx = -1;
>> +
>> +	ret = iterate_inode_ref(sctx, root, path, key, 0, __find_iref,&ctx);
>> +	if (ret<  0)
>> +		return ret;
>> +
>> +	if (ctx.found_idx == -1)
>> +		return -ENOENT;
>> +
>> +	return ctx.found_idx;
>> +}
>> +
>> +static int __record_changed_new_ref(int num, u64 dir, int index,
>> +				    struct fs_path *name,
>> +				    void *ctx)
>> +{
>> +	int ret;
>> +	struct send_ctx *sctx = ctx;
>> +
>> +	ret = find_iref(sctx, sctx->parent_root, sctx->right_path,
>> +			sctx->cmp_key, dir, name);
>> +	if (ret == -ENOENT)
>> +		ret = __record_new_ref(num, dir, index, name, sctx);
>> +	else if (ret>  0)
>> +		ret = 0;
>> +
>> +	return ret;
>> +}
>> +
>> +static int __record_changed_deleted_ref(int num, u64 dir, int index,
>> +					struct fs_path *name,
>> +					void *ctx)
>> +{
>> +	int ret;
>> +	struct send_ctx *sctx = ctx;
>> +
>> +	ret = find_iref(sctx, sctx->send_root, sctx->left_path, sctx->cmp_key,
>> +			dir, name);
>> +	if (ret == -ENOENT)
>> +		ret = __record_deleted_ref(num, dir, index, name, sctx);
>> +	else if (ret>  0)
>> +		ret = 0;
>> +
>> +	return ret;
>> +}
>> +
>> +static int record_changed_ref(struct send_ctx *sctx)
>> +{
>> +	int ret = 0;
>> +
>> +	ret = iterate_inode_ref(sctx, sctx->send_root, sctx->left_path,
>> +			sctx->cmp_key, 0, __record_changed_new_ref, sctx);
>> +	if (ret<  0)
>> +		goto out;
>> +	ret = iterate_inode_ref(sctx, sctx->parent_root, sctx->right_path,
>> +			sctx->cmp_key, 0, __record_changed_deleted_ref, sctx);
>> +
>> +out:
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Record and process all refs at once. Needed when an inode changes the
>> + * generation number, which means that it was deleted and recreated.
>> + */
>> +static int process_all_refs(struct send_ctx *sctx,
>> +			    enum btrfs_compare_tree_result cmd)
>> +{
>> +	int ret;
>> +	struct btrfs_root *root;
>> +	struct btrfs_path *path;
>> +	struct btrfs_key key;
>> +	struct btrfs_key found_key;
>> +	struct extent_buffer *eb;
>> +	int slot;
>> +	iterate_inode_ref_t cb;
>> +
>> +	path = alloc_path_for_send();
>> +	if (!path)
>> +		return -ENOMEM;
>> +
>> +	if (cmd == BTRFS_COMPARE_TREE_NEW) {
>> +		root = sctx->send_root;
>> +		cb = __record_new_ref;
>> +	} else if (cmd == BTRFS_COMPARE_TREE_DELETED) {
>> +		root = sctx->parent_root;
>> +		cb = __record_deleted_ref;
>> +	} else {
>> +		BUG();
>> +	}
>> +
>> +	key.objectid = sctx->cmp_key->objectid;
>> +	key.type = BTRFS_INODE_REF_KEY;
>> +	key.offset = 0;
>> +	while (1) {
>> +		ret = btrfs_search_slot_for_read(root,&key, path, 1, 0);
>> +		if (ret<  0) {
>> +			btrfs_release_path(path);
>
> not needed
Removed.
>
>> +			goto out;
>> +		}
>> +		if (ret) {
>> +			btrfs_release_path(path);
>
> ditto
Was needed here as we have a call to process_recorded_refs after the 
break. But I moved the release out of the loop so that we don't have 
that duplicated for the next break.
>
>> +			break;
>> +		}
>> +
>> +		eb = path->nodes[0];
>> +		slot = path->slots[0];
>> +		btrfs_item_key_to_cpu(eb,&found_key, slot);
>> +
>> +		if (found_key.objectid != key.objectid ||
>> +		    found_key.type != key.type) {
>> +			btrfs_release_path(path);
>
> and here
See above.
>
>> +			break;
>> +		}
>
> helper :)
>
>> +
>> +		ret = iterate_inode_ref(sctx, sctx->parent_root, path,
>> +				&found_key, 0, cb, sctx);
>> +		btrfs_release_path(path);
>> +		if (ret<  0)
>> +			goto out;
>> +
>> +		key.offset = found_key.offset + 1;
>> +	}
>> +
>> +	ret = process_recorded_refs(sctx);
>> +
>> +out:
>> +	btrfs_free_path(path);
>> +	return ret;
>> +}
>> +
>> +static int send_set_xattr(struct send_ctx *sctx,
>> +			  struct fs_path *path,
>> +			  const char *name, int name_len,
>> +			  const char *data, int data_len)
>> +{
>> +	int ret = 0;
>> +
>> +	ret = begin_cmd(sctx, BTRFS_SEND_C_SET_XATTR);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path);
>> +	TLV_PUT_STRING(sctx, BTRFS_SEND_A_XATTR_NAME, name, name_len);
>> +	TLV_PUT(sctx, BTRFS_SEND_A_XATTR_DATA, data, data_len);
>> +
>> +	ret = send_cmd(sctx);
>> +
>> +tlv_put_failure:
>> +out:
>> +	return ret;
>> +}
>> +
>> +static int send_remove_xattr(struct send_ctx *sctx,
>> +			  struct fs_path *path,
>> +			  const char *name, int name_len)
>> +{
>> +	int ret = 0;
>> +
>> +	ret = begin_cmd(sctx, BTRFS_SEND_C_REMOVE_XATTR);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path);
>> +	TLV_PUT_STRING(sctx, BTRFS_SEND_A_XATTR_NAME, name, name_len);
>> +
>> +	ret = send_cmd(sctx);
>> +
>> +tlv_put_failure:
>> +out:
>> +	return ret;
>> +}
>> +
>> +static int __process_new_xattr(int num, const char *name, int name_len,
>> +			       const char *data, int data_len,
>> +			       u8 type, void *ctx)
>> +{
>> +	int ret;
>> +	struct send_ctx *sctx = ctx;
>> +	struct fs_path *p;
>> +	posix_acl_xattr_header dummy_acl;
>> +
>> +	p = fs_path_alloc(sctx);
>> +	if (!p)
>> +		return -ENOMEM;
>> +
>> +	/*
>> +	 * This hack is needed because empty acl's are stored as zero byte
>> +	 * data in xattrs. Problem with that is, that receiving these zero byte
>> +	 * acl's will fail later. To fix this, we send a dummy acl list that
>> +	 * only contains the version number and no entries.
>> +	 */
>> +	if (!strncmp(name, XATTR_NAME_POSIX_ACL_ACCESS, name_len) ||
>> +	    !strncmp(name, XATTR_NAME_POSIX_ACL_DEFAULT, name_len)) {
>> +		if (data_len == 0) {
>> +			dummy_acl.a_version =
>> +					cpu_to_le32(POSIX_ACL_XATTR_VERSION);
>> +			data = (char *)&dummy_acl;
>> +			data_len = sizeof(dummy_acl);
>> +		}
>> +	}
>> +
>> +	ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	ret = send_set_xattr(sctx, p, name, name_len, data, data_len);
>> +
>> +out:
>> +	fs_path_free(sctx, p);
>> +	return ret;
>> +}
>> +
>> +static int __process_deleted_xattr(int num, const char *name, int name_len,
>> +				   const char *data, int data_len,
>> +				   u8 type, void *ctx)
>> +{
>> +	int ret;
>> +	struct send_ctx *sctx = ctx;
>> +	struct fs_path *p;
>> +
>> +	p = fs_path_alloc(sctx);
>> +	if (!p)
>> +		return -ENOMEM;
>> +
>> +	ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	ret = send_remove_xattr(sctx, p, name, name_len);
>> +
>> +out:
>> +	fs_path_free(sctx, p);
>> +	return ret;
>> +}
>> +
>> +static int process_new_xattr(struct send_ctx *sctx)
>> +{
>> +	int ret = 0;
>> +
>> +	ret = iterate_dir_item(sctx, sctx->send_root, sctx->left_path,
>> +			sctx->cmp_key, __process_new_xattr, sctx);
>> +
>> +	return ret;
>> +}
>> +
>> +static int process_deleted_xattr(struct send_ctx *sctx)
>> +{
>> +	int ret;
>> +
>> +	ret = iterate_dir_item(sctx, sctx->parent_root, sctx->right_path,
>> +			sctx->cmp_key, __process_deleted_xattr, sctx);
>> +
>> +	return ret;
>> +}
>> +
>> +struct find_xattr_ctx {
>> +	const char *name;
>> +	int name_len;
>> +	int found_idx;
>> +	char *found_data;
>> +	int found_data_len;
>> +};
>> +
>> +static int __find_xattr(int num, const char *name, int name_len,
>> +			const char *data, int data_len,
>> +			u8 type, void *vctx)
>> +{
>> +	struct find_xattr_ctx *ctx = vctx;
>> +
>> +	if (name_len == ctx->name_len&&
>> +	    strncmp(name, ctx->name, name_len) == 0) {
>> +		ctx->found_idx = num;
>> +		ctx->found_data_len = data_len;
>> +		ctx->found_data = kmalloc(data_len, GFP_NOFS);
>> +		if (!ctx->found_data)
>> +			return -ENOMEM;
>> +		memcpy(ctx->found_data, data, data_len);
>> +		return 1;
>> +	}
>> +	return 0;
>> +}
>> +
>> +static int find_xattr(struct send_ctx *sctx,
>> +		      struct btrfs_root *root,
>> +		      struct btrfs_path *path,
>> +		      struct btrfs_key *key,
>> +		      const char *name, int name_len,
>> +		      char **data, int *data_len)
>> +{
>> +	int ret;
>> +	struct find_xattr_ctx ctx;
>> +
>> +	ctx.name = name;
>> +	ctx.name_len = name_len;
>> +	ctx.found_idx = -1;
>> +	ctx.found_data = NULL;
>> +	ctx.found_data_len = 0;
>> +
>> +	ret = iterate_dir_item(sctx, root, path, key, __find_xattr,&ctx);
>> +	if (ret<  0)
>> +		return ret;
>> +
>> +	if (ctx.found_idx == -1)
>> +		return -ENOENT;
>> +	if (data) {
>> +		*data = ctx.found_data;
>> +		*data_len = ctx.found_data_len;
>> +	} else {
>> +		kfree(ctx.found_data);
>> +	}
>> +	return ctx.found_idx;
>> +}
>> +
>> +
>> +static int __process_changed_new_xattr(int num, const char *name, int name_len,
>> +				       const char *data, int data_len,
>> +				       u8 type, void *ctx)
>> +{
>> +	int ret;
>> +	struct send_ctx *sctx = ctx;
>> +	char *found_data = NULL;
>> +	int found_data_len  = 0;
>> +	struct fs_path *p = NULL;
>> +
>> +	ret = find_xattr(sctx, sctx->parent_root, sctx->right_path,
>> +			sctx->cmp_key, name, name_len,&found_data,
>> +			&found_data_len);
>> +	if (ret == -ENOENT) {
>> +		ret = __process_new_xattr(num, name, name_len, data, data_len,
>> +				type, ctx);
>> +	} else if (ret>= 0) {
>> +		if (data_len != found_data_len ||
>> +		    memcmp(data, found_data, data_len)) {
>> +			ret = __process_new_xattr(num, name, name_len, data,
>> +					data_len, type, ctx);
>> +		} else {
>> +			ret = 0;
>> +		}
>> +	}
>> +
>> +	kfree(found_data);
>> +	fs_path_free(sctx, p);
>> +	return ret;
>> +}
>> +
>> +static int __process_changed_deleted_xattr(int num, const char *name,
>> +					   int name_len,
>> +					   const char *data, int data_len,
>> +					   u8 type, void *ctx)
>> +{
>> +	int ret;
>> +	struct send_ctx *sctx = ctx;
>> +
>> +	ret = find_xattr(sctx, sctx->send_root, sctx->left_path, sctx->cmp_key,
>> +			name, name_len, NULL, NULL);
>> +	if (ret == -ENOENT)
>> +		ret = __process_deleted_xattr(num, name, name_len, data,
>> +				data_len, type, ctx);
>> +	else if (ret>= 0)
>> +		ret = 0;
>> +
>> +	return ret;
>> +}
>> +
>> +static int process_changed_xattr(struct send_ctx *sctx)
>> +{
>> +	int ret = 0;
>> +
>> +	ret = iterate_dir_item(sctx, sctx->send_root, sctx->left_path,
>> +			sctx->cmp_key, __process_changed_new_xattr, sctx);
>> +	if (ret<  0)
>> +		goto out;
>> +	ret = iterate_dir_item(sctx, sctx->parent_root, sctx->right_path,
>> +			sctx->cmp_key, __process_changed_deleted_xattr, sctx);
>> +
>> +out:
>> +	return ret;
>> +}
>> +
>> +static int process_all_new_xattrs(struct send_ctx *sctx)
>> +{
>> +	int ret;
>> +	struct btrfs_root *root;
>> +	struct btrfs_path *path;
>> +	struct btrfs_key key;
>> +	struct btrfs_key found_key;
>> +	struct extent_buffer *eb;
>> +	int slot;
>> +
>> +	path = alloc_path_for_send();
>> +	if (!path)
>> +		return -ENOMEM;
>> +
>> +	root = sctx->send_root;
>> +
>> +	key.objectid = sctx->cmp_key->objectid;
>> +	key.type = BTRFS_XATTR_ITEM_KEY;
>> +	key.offset = 0;
>> +	while (1) {
>> +		ret = btrfs_search_slot_for_read(root,&key, path, 1, 0);
>> +		if (ret<  0)
>> +			goto out;
>> +		if (ret) {
>> +			ret = 0;
>> +			goto out;
>> +		}
>> +
>> +		eb = path->nodes[0];
>> +		slot = path->slots[0];
>> +		btrfs_item_key_to_cpu(eb,&found_key, slot);
>> +
>> +		if (found_key.objectid != key.objectid ||
>> +		    found_key.type != key.type) {
>> +			ret = 0;
>> +			goto out;
>> +		}
>
> helper...
>
>> +
>> +		ret = iterate_dir_item(sctx, root, path,&found_key,
>> +				__process_new_xattr, sctx);
>> +		if (ret<  0)
>> +			goto out;
>> +
>> +		btrfs_release_path(path);
>> +		key.offset = found_key.offset + 1;
>> +	}
>> +
>> +out:
>> +	btrfs_free_path(path);
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Read some bytes from the current inode/file and send a write command to
>> + * user space.
>> + */
>> +static int send_write(struct send_ctx *sctx, u64 offset, u32 len)
>> +{
>> +	int ret = 0;
>> +	struct fs_path *p;
>> +	loff_t pos = offset;
>> +	int readed;
>> +	mm_segment_t old_fs;
>> +
>> +	p = fs_path_alloc(sctx);
>> +	if (!p)
>> +		return -ENOMEM;
>> +
>> +	/*
>> +	 * vfs normally only accepts user space buffers for security reasons.
>> +	 * we only read from the file and also only provide the read_buf buffer
>> +	 * to vfs. As this buffer does not come from a user space call, it's
>> +	 * ok to temporary allow kernel space buffers.
>> +	 */
>> +	old_fs = get_fs();
>> +	set_fs(KERNEL_DS);
>> +
>> +verbose_printk("btrfs: send_write offset=%llu, len=%d\n", offset, len);
>> +
>> +	ret = open_cur_inode_file(sctx);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	ret = vfs_read(sctx->cur_inode_filp, sctx->read_buf, len,&pos);
>> +	if (ret<  0)
>> +		goto out;
>> +	readed = ret;
>
> num_read?
>
renamed.
>> +	if (!readed)
>> +		goto out;
>> +
>> +	ret = begin_cmd(sctx, BTRFS_SEND_C_WRITE);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
>> +	TLV_PUT_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, offset);
>> +	TLV_PUT(sctx, BTRFS_SEND_A_DATA, sctx->read_buf, readed);
>> +
>> +	ret = send_cmd(sctx);
>> +
>> +tlv_put_failure:
>> +out:
>> +	fs_path_free(sctx, p);
>> +	set_fs(old_fs);
>> +	if (ret<  0)
>> +		return ret;
>> +	return readed;
>> +}
>> +
>> +/*
>> + * Send a clone command to user space.
>> + */
>> +static int send_clone(struct send_ctx *sctx,
>> +		      u64 offset, u32 len,
>> +		      struct clone_root *clone_root)
>> +{
>> +	int ret = 0;
>> +	struct btrfs_root *clone_root2 = clone_root->root;
>
> a name from hell :)
>
Removed that one completely and using clone_root->root below.
>> +	struct fs_path *p;
>> +	u64 gen;
>> +
>> +verbose_printk("btrfs: send_clone offset=%llu, len=%d, clone_root=%llu, "
>> +	       "clone_inode=%llu, clone_offset=%llu\n", offset, len,
>> +		clone_root->root->objectid, clone_root->ino,
>> +		clone_root->offset);
>> +
>> +	p = fs_path_alloc(sctx);
>> +	if (!p)
>> +		return -ENOMEM;
>> +
>> +	ret = begin_cmd(sctx, BTRFS_SEND_C_CLONE);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	TLV_PUT_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, offset);
>> +	TLV_PUT_U64(sctx, BTRFS_SEND_A_CLONE_LEN, len);
>> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
>> +
>> +	if (clone_root2 == sctx->send_root) {
>> +		ret = get_inode_info(sctx->send_root, clone_root->ino, NULL,
>> +				&gen, NULL, NULL, NULL);
>> +		if (ret<  0)
>> +			goto out;
>> +		ret = get_cur_path(sctx, clone_root->ino, gen, p);
>> +	} else {
>> +		ret = get_inode_path(sctx, clone_root2, clone_root->ino, p);
>> +	}
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	TLV_PUT_UUID(sctx, BTRFS_SEND_A_CLONE_UUID,
>> +			clone_root2->root_item.uuid);
>> +	TLV_PUT_U64(sctx, BTRFS_SEND_A_CLONE_CTRANSID,
>> +			clone_root2->root_item.ctransid);
>> +	TLV_PUT_PATH(sctx, BTRFS_SEND_A_CLONE_PATH, p);
>> +	TLV_PUT_U64(sctx, BTRFS_SEND_A_CLONE_OFFSET,
>> +			clone_root->offset);
>> +
>> +	ret = send_cmd(sctx);
>> +
>> +tlv_put_failure:
>> +out:
>> +	fs_path_free(sctx, p);
>> +	return ret;
>> +}
>> +
>> +static int send_write_or_clone(struct send_ctx *sctx,
>> +			       struct btrfs_path *path,
>> +			       struct btrfs_key *key,
>> +			       struct clone_root *clone_root)
>> +{
>> +	int ret = 0;
>> +	struct btrfs_file_extent_item *ei;
>> +	u64 offset = key->offset;
>> +	u64 pos = 0;
>> +	u64 len;
>> +	u32 l;
>> +	u8 type;
>> +
>> +	ei = btrfs_item_ptr(path->nodes[0], path->slots[0],
>> +			struct btrfs_file_extent_item);
>> +	type = btrfs_file_extent_type(path->nodes[0], ei);
>> +	if (type == BTRFS_FILE_EXTENT_INLINE)
>> +		len = btrfs_file_extent_inline_len(path->nodes[0], ei);
>> +	else
>> +		len = btrfs_file_extent_num_bytes(path->nodes[0], ei);
>
> BTRFS_FILE_EXTENT_PREALLOC?
>
Isn't num_bytes also valid for PREALLOC?
>> +
>> +	if (offset + len>  sctx->cur_inode_size)
>> +		len = sctx->cur_inode_size - offset;
>> +	if (len == 0) {
>> +		ret = 0;
>> +		goto out;
>> +	}
>> +
>> +	if (!clone_root) {
>> +		while (pos<  len) {
>> +			l = len - pos;
>> +			if (l>  BTRFS_SEND_READ_SIZE)
>> +				l = BTRFS_SEND_READ_SIZE;
>> +			ret = send_write(sctx, pos + offset, l);
>> +			if (ret<  0)
>> +				goto out;
>> +			if (!ret)
>> +				break;
>> +			pos += ret;
>> +		}
>> +		ret = 0;
>> +	} else {
>> +		ret = send_clone(sctx, offset, len, clone_root);
>> +	}
>> +
>> +out:
>> +	return ret;
>> +}
>> +
>> +static int is_extent_unchanged(struct send_ctx *sctx,
>> +			       struct btrfs_path *left_path,
>> +			       struct btrfs_key *ekey)
>> +{
>> +	int ret = 0;
>> +	struct btrfs_key key;
>> +	struct btrfs_path *path = NULL;
>> +	struct extent_buffer *eb;
>> +	int slot;
>> +	struct btrfs_key found_key;
>> +	struct btrfs_file_extent_item *ei;
>> +	u64 left_disknr;
>> +	u64 right_disknr;
>> +	u64 left_offset;
>> +	u64 right_offset;
>> +	u64 left_len;
>> +	u64 right_len;
>> +	u8 left_type;
>> +	u8 right_type;
>> +
>> +	path = alloc_path_for_send();
>> +	if (!path)
>> +		return -ENOMEM;
>> +
>> +	eb = left_path->nodes[0];
>> +	slot = left_path->slots[0];
>> +
>> +	ei = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item);
>> +	left_type = btrfs_file_extent_type(eb, ei);
>> +	left_disknr = btrfs_file_extent_disk_bytenr(eb, ei);
>> +	left_len = btrfs_file_extent_num_bytes(eb, ei);
>> +	left_offset = btrfs_file_extent_offset(eb, ei);
>> +
>> +	if (left_type != BTRFS_FILE_EXTENT_REG) {
>> +		ret = 0;
>> +		goto out;
>> +	}
>> +
>> +	key.objectid = ekey->objectid;
>> +	key.type = BTRFS_EXTENT_DATA_KEY;
>> +	key.offset = ekey->offset;
>> +
>> +	while (1) {
>> +		ret = btrfs_search_slot_for_read(sctx->parent_root,&key, path,
>> +				0, 0);
>> +		if (ret<  0)
>> +			goto out;
>> +		if (ret) {
>> +			ret = 0;
>> +			goto out;
>> +		}
>> +		btrfs_item_key_to_cpu(path->nodes[0],&found_key,
>> +				path->slots[0]);
>> +		if (found_key.objectid != key.objectid ||
>> +		    found_key.type != key.type) {
>> +			ret = 0;
>> +			goto out;
>> +		}
>> +
>
> helper...
>
>> +		eb = path->nodes[0];
>> +		slot = path->slots[0];
>> +
>> +		ei = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item);
>> +		right_type = btrfs_file_extent_type(eb, ei);
>> +		right_disknr = btrfs_file_extent_disk_bytenr(eb, ei);
>> +		right_len = btrfs_file_extent_num_bytes(eb, ei);
>> +		right_offset = btrfs_file_extent_offset(eb, ei);
>> +		btrfs_release_path(path);
>> +
>> +		if (right_type != BTRFS_FILE_EXTENT_REG) {
>> +			ret = 0;
>> +			goto out;
>> +		}
>> +
>> +		if (left_disknr != right_disknr) {
>> +			ret = 0;
>> +			goto out;
>> +		}
>> +
>> +		key.offset = found_key.offset + right_len;
>> +		if (key.offset>= ekey->offset + left_len) {
>> +			ret = 1;
>> +			goto out;
>> +		}
>> +	}
>> +
>> +out:
>> +	btrfs_free_path(path);
>> +	return ret;
>> +}
>> +
>> +static int process_extent(struct send_ctx *sctx,
>> +			  struct btrfs_path *path,
>> +			  struct btrfs_key *key)
>> +{
>> +	int ret = 0;
>> +	struct clone_root *found_clone = NULL;
>> +
>> +	if (S_ISLNK(sctx->cur_inode_mode))
>> +		return 0;
>> +
>> +	if (sctx->parent_root&&  !sctx->cur_inode_new) {
>> +		ret = is_extent_unchanged(sctx, path, key);
>> +		if (ret<  0)
>> +			goto out;
>> +		if (ret) {
>> +			ret = 0;
>> +			goto out;
>> +		}
>> +	}
>> +
>> +	ret = find_extent_clone(sctx, path, key->objectid, key->offset,
>> +			sctx->cur_inode_size,&found_clone);
>> +	if (ret != -ENOENT&&  ret<  0)
>> +		goto out;
>> +
>> +	ret = send_write_or_clone(sctx, path, key, found_clone);
>> +
>> +out:
>> +	return ret;
>> +}
>> +
>> +static int process_all_extents(struct send_ctx *sctx)
>> +{
>> +	int ret;
>> +	struct btrfs_root *root;
>> +	struct btrfs_path *path;
>> +	struct btrfs_key key;
>> +	struct btrfs_key found_key;
>> +	struct extent_buffer *eb;
>> +	int slot;
>> +
>> +	root = sctx->send_root;
>> +	path = alloc_path_for_send();
>> +	if (!path)
>> +		return -ENOMEM;
>> +
>> +	key.objectid = sctx->cmp_key->objectid;
>> +	key.type = BTRFS_EXTENT_DATA_KEY;
>> +	key.offset = 0;
>> +	while (1) {
>> +		ret = btrfs_search_slot_for_read(root,&key, path, 1, 0);
>> +		if (ret<  0)
>> +			goto out;
>> +		if (ret) {
>> +			ret = 0;
>> +			goto out;
>> +		}
>> +
>> +		eb = path->nodes[0];
>> +		slot = path->slots[0];
>> +		btrfs_item_key_to_cpu(eb,&found_key, slot);
>> +
>> +		if (found_key.objectid != key.objectid ||
>> +		    found_key.type != key.type) {
>> +			ret = 0;
>> +			goto out;
>> +		}
>> +
>> +		ret = process_extent(sctx, path,&found_key);
>> +		if (ret<  0)
>> +			goto out;
>> +
>> +		btrfs_release_path(path);
>> +		key.offset = found_key.offset + 1;
>> +	}
>> +
>> +out:
>> +	btrfs_free_path(path);
>> +	return ret;
>> +}
>> +
>> +static int process_recorded_refs_if_needed(struct send_ctx *sctx, int at_end)
>> +{
>> +	int ret = 0;
>> +
>> +	if (sctx->cur_ino == 0)
>> +		goto out;
>> +	if (!at_end&&  sctx->cur_ino == sctx->cmp_key->objectid&&
>> +	    sctx->cmp_key->type<= BTRFS_INODE_REF_KEY)
>> +		goto out;
>> +	if (list_empty(&sctx->new_refs)&&  list_empty(&sctx->deleted_refs))
>> +		goto out;
>> +
>> +	ret = process_recorded_refs(sctx);
>> +
>> +out:
>> +	return ret;
>> +}
>> +
>> +static int finish_inode_if_needed(struct send_ctx *sctx, int at_end)
>> +{
>> +	int ret = 0;
>> +	u64 left_mode;
>> +	u64 left_uid;
>> +	u64 left_gid;
>> +	u64 right_mode;
>> +	u64 right_uid;
>> +	u64 right_gid;
>> +	int need_chmod = 0;
>> +	int need_chown = 0;
>> +
>> +	ret = process_recorded_refs_if_needed(sctx, at_end);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	if (sctx->cur_ino == 0 || sctx->cur_inode_deleted)
>> +		goto out;
>> +	if (!at_end&&  sctx->cmp_key->objectid == sctx->cur_ino)
>> +		goto out;
>> +
>> +	ret = get_inode_info(sctx->send_root, sctx->cur_ino, NULL, NULL,
>> +			&left_mode,&left_uid,&left_gid);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	if (!S_ISLNK(sctx->cur_inode_mode)) {
>> +		if (!sctx->parent_root || sctx->cur_inode_new) {
>> +			need_chmod = 1;
>> +			need_chown = 1;
>> +		} else {
>> +			ret = get_inode_info(sctx->parent_root, sctx->cur_ino,
>> +					NULL, NULL,&right_mode,&right_uid,
>> +					&right_gid);
>> +			if (ret<  0)
>> +				goto out;
>> +
>> +			if (left_uid != right_uid || left_gid != right_gid)
>> +				need_chown = 1;
>> +			if (left_mode != right_mode)
>> +				need_chmod = 1;
>> +		}
>> +	}
>> +
>> +	if (S_ISREG(sctx->cur_inode_mode)) {
>> +		ret = send_truncate(sctx, sctx->cur_ino, sctx->cur_inode_gen,
>> +				sctx->cur_inode_size);
>> +		if (ret<  0)
>> +			goto out;
>> +	}
>> +
>> +	if (need_chown) {
>> +		ret = send_chown(sctx, sctx->cur_ino, sctx->cur_inode_gen,
>> +				left_uid, left_gid);
>> +		if (ret<  0)
>> +			goto out;
>> +	}
>> +	if (need_chmod) {
>> +		ret = send_chmod(sctx, sctx->cur_ino, sctx->cur_inode_gen,
>> +				left_mode);
>> +		if (ret<  0)
>> +			goto out;
>> +	}
>> +
>> +	/*
>> +	 * Need to send that every time, no matter if it actually changed
>> +	 * between the two trees as we have done changes to the inode before.
>> +	 */
>> +	ret = send_utimes(sctx, sctx->cur_ino, sctx->cur_inode_gen);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +out:
>> +	return ret;
>> +}
>> +
>> +static int changed_inode(struct send_ctx *sctx,
>> +			 enum btrfs_compare_tree_result result)
>> +{
>> +	int ret = 0;
>> +	struct btrfs_key *key = sctx->cmp_key;
>> +	struct btrfs_inode_item *left_ii = NULL;
>> +	struct btrfs_inode_item *right_ii = NULL;
>> +	u64 left_gen = 0;
>> +	u64 right_gen = 0;
>> +
>> +	ret = close_cur_inode_file(sctx);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	sctx->cur_ino = key->objectid;
>> +	sctx->cur_inode_new_gen = 0;
>> +	sctx->send_progress = sctx->cur_ino;
>> +
>> +	if (result == BTRFS_COMPARE_TREE_NEW ||
>> +	    result == BTRFS_COMPARE_TREE_CHANGED) {
>> +		left_ii = btrfs_item_ptr(sctx->left_path->nodes[0],
>> +				sctx->left_path->slots[0],
>> +				struct btrfs_inode_item);
>> +		left_gen = btrfs_inode_generation(sctx->left_path->nodes[0],
>> +				left_ii);
>> +	} else {
>> +		right_ii = btrfs_item_ptr(sctx->right_path->nodes[0],
>> +				sctx->right_path->slots[0],
>> +				struct btrfs_inode_item);
>> +		right_gen = btrfs_inode_generation(sctx->right_path->nodes[0],
>> +				right_ii);
>> +	}
>> +	if (result == BTRFS_COMPARE_TREE_CHANGED) {
>> +		right_ii = btrfs_item_ptr(sctx->right_path->nodes[0],
>> +				sctx->right_path->slots[0],
>> +				struct btrfs_inode_item);
>> +
>> +		right_gen = btrfs_inode_generation(sctx->right_path->nodes[0],
>> +				right_ii);
>> +		if (left_gen != right_gen)
>> +			sctx->cur_inode_new_gen = 1;
>> +	}
>> +
>> +	if (result == BTRFS_COMPARE_TREE_NEW) {
>> +		sctx->cur_inode_gen = left_gen;
>> +		sctx->cur_inode_new = 1;
>> +		sctx->cur_inode_deleted = 0;
>> +		sctx->cur_inode_size = btrfs_inode_size(
>> +				sctx->left_path->nodes[0], left_ii);
>> +		sctx->cur_inode_mode = btrfs_inode_mode(
>> +				sctx->left_path->nodes[0], left_ii);
>> +		if (sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID)
>> +			ret = send_create_inode(sctx, sctx->left_path,
>> +					sctx->cmp_key);
>> +	} else if (result == BTRFS_COMPARE_TREE_DELETED) {
>> +		sctx->cur_inode_gen = right_gen;
>> +		sctx->cur_inode_new = 0;
>> +		sctx->cur_inode_deleted = 1;
>> +		sctx->cur_inode_size = btrfs_inode_size(
>> +				sctx->right_path->nodes[0], right_ii);
>> +		sctx->cur_inode_mode = btrfs_inode_mode(
>> +				sctx->right_path->nodes[0], right_ii);
>> +	} else if (result == BTRFS_COMPARE_TREE_CHANGED) {
>> +		if (sctx->cur_inode_new_gen) {
>> +			sctx->cur_inode_gen = right_gen;
>> +			sctx->cur_inode_new = 0;
>> +			sctx->cur_inode_deleted = 1;
>> +			sctx->cur_inode_size = btrfs_inode_size(
>> +					sctx->right_path->nodes[0], right_ii);
>> +			sctx->cur_inode_mode = btrfs_inode_mode(
>> +					sctx->right_path->nodes[0], right_ii);
>> +			ret = process_all_refs(sctx,
>> +					BTRFS_COMPARE_TREE_DELETED);
>> +			if (ret<  0)
>> +				goto out;
>> +
>> +			sctx->cur_inode_gen = left_gen;
>> +			sctx->cur_inode_new = 1;
>> +			sctx->cur_inode_deleted = 0;
>> +			sctx->cur_inode_size = btrfs_inode_size(
>> +					sctx->left_path->nodes[0], left_ii);
>> +			sctx->cur_inode_mode = btrfs_inode_mode(
>> +					sctx->left_path->nodes[0], left_ii);
>> +			ret = send_create_inode(sctx, sctx->left_path,
>> +					sctx->cmp_key);
>> +			if (ret<  0)
>> +				goto out;
>> +
>> +			ret = process_all_refs(sctx, BTRFS_COMPARE_TREE_NEW);
>> +			if (ret<  0)
>> +				goto out;
>> +			ret = process_all_extents(sctx);
>> +			if (ret<  0)
>> +				goto out;
>> +			ret = process_all_new_xattrs(sctx);
>> +			if (ret<  0)
>> +				goto out;
>> +		} else {
>> +			sctx->cur_inode_gen = left_gen;
>> +			sctx->cur_inode_new = 0;
>> +			sctx->cur_inode_new_gen = 0;
>> +			sctx->cur_inode_deleted = 0;
>> +			sctx->cur_inode_size = btrfs_inode_size(
>> +					sctx->left_path->nodes[0], left_ii);
>> +			sctx->cur_inode_mode = btrfs_inode_mode(
>> +					sctx->left_path->nodes[0], left_ii);
>> +		}
>> +	}
>> +
>> +out:
>> +	return ret;
>> +}
>> +
>> +static int changed_ref(struct send_ctx *sctx,
>> +		       enum btrfs_compare_tree_result result)
>> +{
>> +	int ret = 0;
>> +
>> +	BUG_ON(sctx->cur_ino != sctx->cmp_key->objectid);
>> +
>> +	if (!sctx->cur_inode_new_gen&&
>> +	    sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID) {
>> +		if (result == BTRFS_COMPARE_TREE_NEW)
>> +			ret = record_new_ref(sctx);
>> +		else if (result == BTRFS_COMPARE_TREE_DELETED)
>> +			ret = record_deleted_ref(sctx);
>> +		else if (result == BTRFS_COMPARE_TREE_CHANGED)
>> +			ret = record_changed_ref(sctx);
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static int changed_xattr(struct send_ctx *sctx,
>> +			 enum btrfs_compare_tree_result result)
>> +{
>> +	int ret = 0;
>> +
>> +	BUG_ON(sctx->cur_ino != sctx->cmp_key->objectid);
>> +
>> +	if (!sctx->cur_inode_new_gen&&  !sctx->cur_inode_deleted) {
>> +		if (result == BTRFS_COMPARE_TREE_NEW)
>> +			ret = process_new_xattr(sctx);
>> +		else if (result == BTRFS_COMPARE_TREE_DELETED)
>> +			ret = process_deleted_xattr(sctx);
>> +		else if (result == BTRFS_COMPARE_TREE_CHANGED)
>> +			ret = process_changed_xattr(sctx);
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static int changed_extent(struct send_ctx *sctx,
>> +			  enum btrfs_compare_tree_result result)
>> +{
>> +	int ret = 0;
>> +
>> +	BUG_ON(sctx->cur_ino != sctx->cmp_key->objectid);
>> +
>> +	if (!sctx->cur_inode_new_gen&&  !sctx->cur_inode_deleted) {
>> +		if (result != BTRFS_COMPARE_TREE_DELETED)
>> +			ret = process_extent(sctx, sctx->left_path,
>> +					sctx->cmp_key);
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +
>> +static int changed_cb(struct btrfs_root *left_root,
>> +		      struct btrfs_root *right_root,
>> +		      struct btrfs_path *left_path,
>> +		      struct btrfs_path *right_path,
>> +		      struct btrfs_key *key,
>> +		      enum btrfs_compare_tree_result result,
>> +		      void *ctx)
>> +{
>> +	int ret = 0;
>> +	struct send_ctx *sctx = ctx;
>> +
>> +	sctx->left_path = left_path;
>> +	sctx->right_path = right_path;
>> +	sctx->cmp_key = key;
>> +
>> +	ret = finish_inode_if_needed(sctx, 0);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	if (key->type == BTRFS_INODE_ITEM_KEY)
>> +		ret = changed_inode(sctx, result);
>> +	else if (key->type == BTRFS_INODE_REF_KEY)
>> +		ret = changed_ref(sctx, result);
>> +	else if (key->type == BTRFS_XATTR_ITEM_KEY)
>> +		ret = changed_xattr(sctx, result);
>> +	else if (key->type == BTRFS_EXTENT_DATA_KEY)
>> +		ret = changed_extent(sctx, result);
>> +
>> +out:
>> +	return ret;
>> +}
>> +
>> +static int full_send_tree(struct send_ctx *sctx)
>> +{
>> +	int ret;
>> +	struct btrfs_trans_handle *trans = NULL;
>> +	struct btrfs_root *send_root = sctx->send_root;
>> +	struct btrfs_key key;
>> +	struct btrfs_key found_key;
>> +	struct btrfs_path *path;
>> +	struct extent_buffer *eb;
>> +	int slot;
>> +	u64 start_ctransid;
>> +	u64 ctransid;
>> +
>> +	path = alloc_path_for_send();
>> +	if (!path)
>> +		return -ENOMEM;
>> +
>> +	spin_lock(&send_root->root_times_lock);
>> +	start_ctransid = btrfs_root_ctransid(&send_root->root_item);
>> +	spin_unlock(&send_root->root_times_lock);
>> +
>> +	key.objectid = BTRFS_FIRST_FREE_OBJECTID;
>> +	key.type = BTRFS_INODE_ITEM_KEY;
>> +	key.offset = 0;
>> +
>> +join_trans:
>> +	/*
>> +	 * We need to make sure the transaction does not get committed
>> +	 * while we do anything on commit roots. Join a transaction to prevent
>> +	 * this.
>> +	 */
>> +	trans = btrfs_join_transaction(send_root);
>> +	if (IS_ERR(trans)) {
>> +		ret = PTR_ERR(trans);
>> +		trans = NULL;
>> +		goto out;
>> +	}
>> +
>> +	/*
>> +	 * Make sure the tree has not changed
>> +	 */
>> +	spin_lock(&send_root->root_times_lock);
>> +	ctransid = btrfs_root_ctransid(&send_root->root_item);
>> +	spin_unlock(&send_root->root_times_lock);
>> +
>> +	if (ctransid != start_ctransid) {
>> +		WARN(1, KERN_WARNING "btrfs: the root that you're trying to "
>> +				     "send was modified in between. This is "
>> +				     "probably a bug.\n");
>
> What is the purpose of getting the ctransid outside the
> transaction anyway?
>
Hmm I don't understand the question...
>> +		ret = -EIO;
>> +		goto out;
>> +	}
>> +
>> +	ret = btrfs_search_slot_for_read(send_root,&key, path, 1, 0);
>> +	if (ret<  0)
>> +		goto out;
>> +	if (ret)
>> +		goto out_finish;
>> +
>> +	while (1) {
>> +		/*
>> +		 * When someone want to commit while we iterate, end the
>> +		 * joined transaction and rejoin.
>> +		 */
>> +		if (btrfs_should_end_transaction(trans, send_root)) {
>> +			ret = btrfs_end_transaction(trans, send_root);
>> +			trans = NULL;
>> +			if (ret<  0)
>> +				goto out;
>> +			btrfs_release_path(path);
>> +			goto join_trans;
>> +		}
>> +
>> +		eb = path->nodes[0];
>> +		slot = path->slots[0];
>> +		btrfs_item_key_to_cpu(eb,&found_key, slot);
>> +
>> +		ret = changed_cb(send_root, NULL, path, NULL,
>> +				&found_key, BTRFS_COMPARE_TREE_NEW, sctx);
>> +		if (ret<  0)
>> +			goto out;
>> +
>> +		key.objectid = found_key.objectid;
>> +		key.type = found_key.type;
>> +		key.offset = found_key.offset + 1;
>
> shouldn't this just be before the goto join_trans?
>
Hmm I don't think so. Am I missing something?
>> +
>> +		ret = btrfs_next_item(send_root, path);
>> +		if (ret<  0)
>> +			goto out;
>> +		if (ret) {
>> +			ret  = 0;
>> +			break;
>> +		}
>> +	}
>> +
>> +out_finish:
>> +	ret = finish_inode_if_needed(sctx, 1);
>> +
>> +out:
>> +	btrfs_free_path(path);
>> +	if (trans) {
>> +		if (!ret)
>> +			ret = btrfs_end_transaction(trans, send_root);
>> +		else
>> +			btrfs_end_transaction(trans, send_root);
>> +	}
>> +	return ret;
>> +}
>> +
>> +static int send_subvol(struct send_ctx *sctx)
>> +{
>> +	int ret;
>> +
>> +	ret = send_header(sctx);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	ret = send_subvol_begin(sctx);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	if (sctx->parent_root) {
>> +		ret = btrfs_compare_trees(sctx->send_root, sctx->parent_root,
>> +				changed_cb, sctx);
>> +		if (ret<  0)
>> +			goto out;
>> +		ret = finish_inode_if_needed(sctx, 1);
>> +		if (ret<  0)
>> +			goto out;
>> +	} else {
>> +		ret = full_send_tree(sctx);
>> +		if (ret<  0)
>> +			goto out;
>> +	}
>> +
>> +out:
>> +	if (!ret)
>> +		ret = close_cur_inode_file(sctx);
>> +	else
>> +		close_cur_inode_file(sctx);
>> +
>> +	free_recorded_refs(sctx);
>> +	return ret;
>> +}
>> +
>> +long btrfs_ioctl_send(struct file *mnt_file, void __user *arg_)
>> +{
>> +	int ret = 0;
>> +	struct btrfs_root *send_root;
>> +	struct btrfs_root *clone_root;
>> +	struct btrfs_fs_info *fs_info;
>> +	struct btrfs_ioctl_send_args *arg = NULL;
>> +	struct btrfs_key key;
>> +	struct file *filp = NULL;
>> +	struct send_ctx *sctx = NULL;
>> +	u32 i;
>> +	u64 *clone_sources_tmp = NULL;
>> +
>> +	if (!capable(CAP_SYS_ADMIN))
>> +		return -EPERM;
>> +
>> +	send_root = BTRFS_I(fdentry(mnt_file)->d_inode)->root;
>> +	fs_info = send_root->fs_info;
>> +
>> +	arg = memdup_user(arg_, sizeof(*arg));
>> +	if (IS_ERR(arg)) {
>> +		ret = PTR_ERR(arg);
>> +		arg = NULL;
>> +		goto out;
>> +	}
>> +
>> +	if (!access_ok(VERIFY_READ, arg->clone_sources,
>> +			sizeof(*arg->clone_sources *
>> +			arg->clone_sources_count))) {
>> +		ret = -EFAULT;
>> +		goto out;
>> +	}
>> +
>> +	sctx = kzalloc(sizeof(struct send_ctx), GFP_NOFS);
>> +	if (!sctx) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	INIT_LIST_HEAD(&sctx->new_refs);
>> +	INIT_LIST_HEAD(&sctx->deleted_refs);
>> +	INIT_RADIX_TREE(&sctx->name_cache, GFP_NOFS);
>> +	INIT_LIST_HEAD(&sctx->name_cache_list);
>> +
>> +	sctx->send_filp = fget(arg->send_fd);
>> +	if (IS_ERR(sctx->send_filp)) {
>> +		ret = PTR_ERR(sctx->send_filp);
>> +		goto out;
>> +	}
>> +
>> +	sctx->mnt = mnt_file->f_path.mnt;
>> +
>> +	sctx->send_root = send_root;
>> +	sctx->clone_roots_cnt = arg->clone_sources_count;
>> +
>> +	sctx->send_max_size = BTRFS_SEND_BUF_SIZE;
>> +	sctx->send_buf = vmalloc(sctx->send_max_size);
>> +	if (!sctx->send_buf) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	sctx->read_buf = vmalloc(BTRFS_SEND_READ_SIZE);
>> +	if (!sctx->read_buf) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	sctx->clone_roots = vzalloc(sizeof(struct clone_root) *
>> +			(arg->clone_sources_count + 1));
>> +	if (!sctx->clone_roots) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	if (arg->clone_sources_count) {
>> +		clone_sources_tmp = vmalloc(arg->clone_sources_count *
>> +				sizeof(*arg->clone_sources));
>> +		if (!clone_sources_tmp) {
>> +			ret = -ENOMEM;
>> +			goto out;
>> +		}
>> +
>> +		ret = copy_from_user(clone_sources_tmp, arg->clone_sources,
>> +				arg->clone_sources_count *
>> +				sizeof(*arg->clone_sources));
>> +		if (ret) {
>> +			ret = -EFAULT;
>> +			goto out;
>> +		}
>> +
>> +		for (i = 0; i<  arg->clone_sources_count; i++) {
>> +			key.objectid = clone_sources_tmp[i];
>> +			key.type = BTRFS_ROOT_ITEM_KEY;
>> +			key.offset = (u64)-1;
>> +			clone_root = btrfs_read_fs_root_no_name(fs_info,&key);
>> +			if (!clone_root) {
>> +				ret = -EINVAL;
>> +				goto out;
>> +			}
>> +			if (IS_ERR(clone_root)) {
>> +				ret = PTR_ERR(clone_root);
>> +				goto out;
>> +			}
>> +			sctx->clone_roots[i].root = clone_root;
>> +		}
>> +		vfree(clone_sources_tmp);
>> +		clone_sources_tmp = NULL;
>> +	}
>> +
>> +	if (arg->parent_root) {
>> +		key.objectid = arg->parent_root;
>> +		key.type = BTRFS_ROOT_ITEM_KEY;
>> +		key.offset = (u64)-1;
>> +		sctx->parent_root = btrfs_read_fs_root_no_name(fs_info,&key);
>> +		if (!sctx->parent_root) {
>> +			ret = -EINVAL;
>> +			goto out;
>> +		}
>> +	}
>> +
>> +	/*
>> +	 * Clones from send_root are allowed, but only if the clone source
>> +	 * is behind the current send position. This is checked while searching
>> +	 * for possible clone sources.
>> +	 */
>> +	sctx->clone_roots[sctx->clone_roots_cnt++].root = sctx->send_root;
>> +
>> +	/* We do a bsearch later */
>> +	sort(sctx->clone_roots, sctx->clone_roots_cnt,
>> +			sizeof(*sctx->clone_roots), __clone_root_cmp_sort,
>> +			NULL);
>> +
>> +	ret = send_subvol(sctx);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +	ret = begin_cmd(sctx, BTRFS_SEND_C_END);
>> +	if (ret<  0)
>> +		goto out;
>> +	ret = send_cmd(sctx);
>> +	if (ret<  0)
>> +		goto out;
>> +
>> +out:
>> +	if (filp)
>> +		fput(filp);
>> +	kfree(arg);
>> +	vfree(clone_sources_tmp);
>> +
>> +	if (sctx) {
>> +		if (sctx->send_filp)
>> +			fput(sctx->send_filp);
>> +
>> +		vfree(sctx->clone_roots);
>> +		vfree(sctx->send_buf);
>> +		vfree(sctx->read_buf);
>> +
>> +		name_cache_free(sctx);
>> +
>> +		kfree(sctx);
>> +	}
>> +
>> +	return ret;
>> +}
>> diff --git a/fs/btrfs/send.h b/fs/btrfs/send.h
>> index a4c23ee..53f8ee7 100644
>> --- a/fs/btrfs/send.h
>> +++ b/fs/btrfs/send.h
>> @@ -124,3 +124,7 @@ enum {
>>   	__BTRFS_SEND_A_MAX,
>>   };
>>   #define BTRFS_SEND_A_MAX (__BTRFS_SEND_A_MAX - 1)
>> +
>> +#ifdef __KERNEL__
>> +long btrfs_ioctl_send(struct file *mnt_file, void __user *arg);
>> +#endif
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 7/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 2)
  2012-07-23 15:17   ` Alex Lyakas
@ 2012-08-01 12:54     ` Alexander Block
  0 siblings, 0 replies; 43+ messages in thread
From: Alexander Block @ 2012-08-01 12:54 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs

On Mon, Jul 23, 2012 at 5:17 PM, Alex Lyakas
<alex.bolshoy.btrfs@gmail.com> wrote:
> Hi Alexander,
> I did some testing of the case where same inode, but with a different
> generation, exists both in send_root and in parent_root.
> I know that this can happen primarily when "inode_cache" option is
> enabled. So first I just tested some differential sends, where parent
> and root are unrelated subvolumes. Here are some issues:
>
> 1) The top subvolume inode (ino=BTRFS_FIRST_FREE_OBJECTID) is treated
> also as deleted + recreated. So the code goes into process_all_refs()
> path and does several strange things, such as trying to orphanize the
> top inode. Also get_cur_path() always returns "" for the top subvolume
> (without checking whether it is an orphan).  Another complication for
> the top inode is that its parent dir is itself.
> I made the following fix:
> @@ -3782,7 +3972,13 @@ static int changed_inode(struct send_ctx *sctx,
>
>                 right_gen = btrfs_inode_generation(sctx->right_path->nodes[0],
>                                 right_ii);
> -               if (left_gen != right_gen)
> +               if (left_gen != right_gen && sctx->cur_ino !=
> BTRFS_FIRST_FREE_OBJECTID)
>                         sctx->cur_inode_new_gen = 1;
>
> So basically, don't try to delete and re-create it, but treat it like
> a change. Since the top subvolume inode is S_IFDIR, and dir can have
> only one hardlink (and hopefully it is always ".."), we will never
> need to change anything for this INODE_REF. I also added:
>
> @@ -2526,6 +2615,14 @@ static int process_recorded_refs(struct send_ctx *sctx)
>         int did_overwrite = 0;
>         int is_orphan = 0;
>
> +       BUG_ON(sctx->cur_ino <= BTRFS_FIRST_FREE_OBJECTID);
I applied both fixes to for-chris now.
>
> 2) After I fixed this, I hit another issue, where inodes under the top
> subvolume dir, attempt to rmdir() the top dir, while iterating over
> check_dirs in process_recorded_refs(), because (top_dir_ino,
> top_dir_gen) indicate that it was deleted. So I added:
>
> @@ -2714,10 +2857,19 @@ verbose_printk("btrfs: process_recorded_refs
> %llu\n", sctx->cur_ino);
>          */
>         ULIST_ITER_INIT(&uit);
>         while ((un = ulist_next(check_dirs, &uit))) {
> +               /* Do not attempt to rmdir it the top subvolume dir */
> +               if (un->val == BTRFS_FIRST_FREE_OBJECTID)
> +                       continue;
> +
>                 if (un->val > sctx->cur_ino)
>                         continue;
I applied a similar fix by adding a check to can_rmdir. The way you
suggested would also skip utime updates for the top dir.
>
> 3) process_recorded_refs() always increments the send_progress:
>         /*
>          * Current inode is now at it's new position, so we must increase
>          * send_progress
>          */
>         sctx->send_progress = sctx->cur_ino + 1;
>
> However, in the changed_inode() path I am testing, process_all_refs()
> is called twice with the same inode (once for deleted inode, once for
> the recreated inode), so after the first call, send_progress is
> incremented and doesn't match the inode anymore. I don't think I hit
> any issues because of this, just that it's confusing.
I fixed this issue some days ago.
>
> 4)
>
>> +/*
>> + * Record and process all refs at once. Needed when an inode changes the
>> + * generation number, which means that it was deleted and recreated.
>> + */
>> +static int process_all_refs(struct send_ctx *sctx,
>> +                           enum btrfs_compare_tree_result cmd)
>> +{
>> +       int ret;
>> +       struct btrfs_root *root;
>> +       struct btrfs_path *path;
>> +       struct btrfs_key key;
>> +       struct btrfs_key found_key;
>> +       struct extent_buffer *eb;
>> +       int slot;
>> +       iterate_inode_ref_t cb;
>> +
>> +       path = alloc_path_for_send();
>> +       if (!path)
>> +               return -ENOMEM;
>> +
>> +       if (cmd == BTRFS_COMPARE_TREE_NEW) {
>> +               root = sctx->send_root;
>> +               cb = __record_new_ref;
>> +       } else if (cmd == BTRFS_COMPARE_TREE_DELETED) {
>> +               root = sctx->parent_root;
>> +               cb = __record_deleted_ref;
>> +       } else {
>> +               BUG();
>> +       }
>> +
>> +       key.objectid = sctx->cmp_key->objectid;
>> +       key.type = BTRFS_INODE_REF_KEY;
>> +       key.offset = 0;
>> +       while (1) {
>> +               ret = btrfs_search_slot_for_read(root, &key, path, 1, 0);
>> +               if (ret < 0) {
>> +                       btrfs_release_path(path);
>> +                       goto out;
>> +               }
>> +               if (ret) {
>> +                       btrfs_release_path(path);
>> +                       break;
>> +               }
>> +
>> +               eb = path->nodes[0];
>> +               slot = path->slots[0];
>> +               btrfs_item_key_to_cpu(eb, &found_key, slot);
>> +
>> +               if (found_key.objectid != key.objectid ||
>> +                   found_key.type != key.type) {
>> +                       btrfs_release_path(path);
>> +                       break;
>> +               }
>> +
>> +               ret = iterate_inode_ref(sctx, sctx->parent_root, path,
>> +                               &found_key, 0, cb, sctx);
>
> Shouldn't it be the root that you calculated eariler and not
> sctx->parent_root? I guess in this case it doesn't matter, because
> "resolve" is 0, and the passed root is only used for resolve. But
> still confusing.
You're right, atm it does not matter which root we use here. It is
more correct to pass 'root' instead of parent_root.
>
> 5) When I started testing with "inode_cache" enabled, I hit another
> issue. When this mount option is enabled, then FREE_INO and FREE_SPACE
> items now appear in the file tree. As a result, the code tries to
> create the FREE_INO item with an orphan name, then tries to find its
> INODE_REF, but fails because it has no INODE_REFs. So
>
> @@ -3923,6 +4127,13 @@ static int changed_cb(struct btrfs_root *left_root,
>         int ret = 0;
>         struct send_ctx *sctx = ctx;
>
> +       /* Ignore non-FS objects */
> +       if (key->objectid == BTRFS_FREE_INO_OBJECTID ||
> +               key->objectid == BTRFS_FREE_SPACE_OBJECTID)
> +               return 0;
>
> makes sense?
Yepp. I however added it after the finish_inode_if_needed call. The
call is still required to finish the previous inode.
>
> Thanks,
> Alex.

Thanks again. Pushed to for-chris.

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2012-08-01 12:54 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-04 13:38 [RFC PATCH 0/7] Experimental btrfs send/receive (kernel side) Alexander Block
2012-07-04 13:38 ` [RFC PATCH 1/7] Btrfs: use _IOR for BTRFS_IOC_SUBVOL_GETFLAGS Alexander Block
2012-07-04 13:38 ` [RFC PATCH 2/7] Btrfs: add helper for tree enumeration Alexander Block
2012-07-04 13:38 ` [RFC PATCH 3/7] Btrfs: make iref_to_path non static Alexander Block
2012-07-04 13:38 ` [RFC PATCH 4/7] Btrfs: introduce subvol uuids and times Alexander Block
2012-07-05 11:51   ` Alexander Block
2012-07-05 17:08   ` Zach Brown
2012-07-05 17:14     ` Alexander Block
2012-07-05 17:20       ` Zach Brown
2012-07-05 18:33         ` Ilya Dryomov
2012-07-05 18:37           ` Zach Brown
2012-07-05 18:59             ` Ilya Dryomov
2012-07-05 19:01               ` Zach Brown
2012-07-05 19:18                 ` Alexander Block
2012-07-05 19:24                   ` Ilya Dryomov
2012-07-05 19:43                     ` Alexander Block
2012-07-16 14:56   ` Arne Jansen
2012-07-23 19:41     ` Alexander Block
2012-07-24  5:55       ` Arne Jansen
2012-07-25 10:51         ` Alexander Block
2012-07-04 13:38 ` [RFC PATCH 5/7] Btrfs: add btrfs_compare_trees function Alexander Block
2012-07-04 18:27   ` Alex Lyakas
2012-07-04 19:49     ` Alexander Block
2012-07-04 19:13   ` Alex Lyakas
2012-07-04 20:18     ` Alexander Block
2012-07-04 23:31       ` David Sterba
2012-07-05 12:19       ` Alex Lyakas
2012-07-05 12:47         ` Alexander Block
2012-07-05 13:04           ` Alex Lyakas
2012-07-04 13:38 ` [RFC PATCH 6/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 1) Alexander Block
2012-07-18  6:59   ` Arne Jansen
2012-07-25 17:33     ` Alexander Block
2012-07-21 10:53   ` Arne Jansen
2012-07-04 13:38 ` [RFC PATCH 7/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 2) Alexander Block
2012-07-10 15:26   ` Alex Lyakas
2012-07-25 13:37     ` Alexander Block
2012-07-25 17:20       ` Alex Lyakas
2012-07-25 17:41         ` Alexander Block
2012-07-23 11:16   ` Arne Jansen
2012-07-23 15:28     ` Alex Lyakas
2012-07-28 13:49     ` Alexander Block
2012-07-23 15:17   ` Alex Lyakas
2012-08-01 12:54     ` Alexander Block

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.