All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Btrfs: fix fsync log replay for inodes with a mix of regular refs and extrefs
@ 2015-01-13 16:27 Filipe Manana
  2015-01-14  1:28 ` [PATCH v2] " Filipe Manana
  2015-01-14  1:52 ` [PATCH v3] " Filipe Manana
  0 siblings, 2 replies; 3+ messages in thread
From: Filipe Manana @ 2015-01-13 16:27 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

If we have an inode with a large number of hard links, some of which may
be extrefs, turn a regular ref into an extref, fsync the inode and then
replay the fsync log (after a crash/reboot), we can endup with an fsync
log that makes the replay code always fail with -EOVERFLOW when processing
the inode's references.

This is easy to reproduce with the test case I made for xfstests. Its steps
are the following:

   _scratch_mkfs "-O extref" >> $seqres.full 2>&1
   _init_flakey
   _mount_flakey

   # Create a test file with 3001 hard links. This number is large enough to
   # make btrfs start using extrefs at some point even if the fs has the maximum
   # possible leaf/node size (64Kb).
   echo "hello world" > $SCRATCH_MNT/foo
   for i in `seq 1 3000`; do
       ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i`
   done

   # Make sure all metadata and data are durably persisted.
   sync

   # Now remove one link, add a new one with a new name, add another new one with
   # the same name as the one we just removed and fsync the inode.
   rm -f $SCRATCH_MNT/foo_link_0001
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_0001
   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

   # Simulate a crash/power loss. This makes sure the next mount
   # will see an fsync log and will replay that log.

   _load_flakey_table $FLAKEY_DROP_WRITES
   _unmount_flakey

   _load_flakey_table $FLAKEY_ALLOW_WRITES
   _mount_flakey

So on overflow error when overwriting a reference item (regular or extend
reference item), delete the old and replace it with the one in the fsync
log.

This issue has been present since the introduction of the extrefs feature
(2012).

A test case for xfstests follows soon. This test only passes if the previous
patch titled "Btrfs: fix fsync when extend references are added to an inode"
is applied too.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index ecf462a..a1ce105 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -1245,6 +1245,28 @@ static noinline int add_inode_ref(struct btrfs_trans_handle *trans,
 
 	/* finally write the back reference in the inode */
 	ret = overwrite_item(trans, root, path, eb, slot, key);
+	if (ret == -EOVERFLOW) {
+		/*
+		 * This means we have a reference item in the fs/subvol tree
+		 * that groups multiple references, some of which were added
+		 * by the above loop, some are current and some are obsolete
+		 * and are going to be deleted by a future stage of the fsync
+		 * log replay code. So just delete the item and copy the
+		 * one from the log tree into the fs/subvol tree - this is
+		 * safe and later if a link count in the inode is incorrect,
+		 * it will be corrected by our log replay code.
+		 */
+		ret = btrfs_search_slot(trans, root, key, path, -1, 1);
+		if (WARN_ON(ret == 1))
+			ret = -EIO;
+		if (ret < 0)
+			goto out;
+		ret = btrfs_del_item(trans, root, path);
+		if (ret)
+			goto out;
+		btrfs_release_path(path);
+		ret = overwrite_item(trans, root, path, eb, slot, key);
+	}
 out:
 	btrfs_release_path(path);
 	kfree(name);
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH v2] Btrfs: fix fsync log replay for inodes with a mix of regular refs and extrefs
  2015-01-13 16:27 [PATCH] Btrfs: fix fsync log replay for inodes with a mix of regular refs and extrefs Filipe Manana
@ 2015-01-14  1:28 ` Filipe Manana
  2015-01-14  1:52 ` [PATCH v3] " Filipe Manana
  1 sibling, 0 replies; 3+ messages in thread
From: Filipe Manana @ 2015-01-14  1:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

If we have an inode with a large number of hard links, some of which may
be extrefs, turn a regular ref into an extref, fsync the inode and then
replay the fsync log (after a crash/reboot), we can endup with an fsync
log that makes the replay code always fail with -EOVERFLOW when processing
the inode's references.

This is easy to reproduce with the test case I made for xfstests. Its steps
are the following:

   _scratch_mkfs "-O extref" >> $seqres.full 2>&1
   _init_flakey
   _mount_flakey

   # Create a test file with 3001 hard links. This number is large enough to
   # make btrfs start using extrefs at some point even if the fs has the maximum
   # possible leaf/node size (64Kb).
   echo "hello world" > $SCRATCH_MNT/foo
   for i in `seq 1 3000`; do
       ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i`
   done

   # Make sure all metadata and data are durably persisted.
   sync

   # Now remove one link, add a new one with a new name, add another new one with
   # the same name as the one we just removed and fsync the inode.
   rm -f $SCRATCH_MNT/foo_link_0001
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_0001
   rm -f $SCRATCH_MNT/foo_link_0002
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3002
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3003
   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

   # Simulate a crash/power loss. This makes sure the next mount
   # will see an fsync log and will replay that log.

   _load_flakey_table $FLAKEY_DROP_WRITES
   _unmount_flakey

   _load_flakey_table $FLAKEY_ALLOW_WRITES
   _mount_flakey

   # Check that the number of hard links is correct, we are able to remove all
   # the hard links and read the file's data. This is just to verify we don't
   # get stale file handle errors (due to dangling directory index entries that
   # point to inodes that no longer exist).
   echo "Link count: $(stat --format=%h $SCRATCH_MNT/foo)"
   [ -f $SCRATCH_MNT/foo ] || echo "Link foo is missing"
   for ((i = 1; i <= 3003; i++)); do
       name=foo_link_`printf "%04d" $i`
       if [ $i -eq 2 ]; then
           [ -f $SCRATCH_MNT/$name ] && echo "Link $name found"
       else
           [ -f $SCRATCH_MNT/$name ] || echo "Link $name is missing"
       fi
   done
   rm -f $SCRATCH_MNT/foo_link_*
   cat $SCRATCH_MNT/foo
   rm -f $SCRATCH_MNT/foo

   status=0
   exit

The fix is simply to correct the overflow condition when overwriting a
reference item because it was wrong, trying to increase the item in the
fs/subvol tree by an impossible amount. Also ensure that we don't insert
one normal ref and one ext ref for the same dentry - this happened because
processing a dir index entry from the parent in the log happened when
the normal ref item was full, which made the logic insert an extref and
later when the normal ref had enough room, it would be inserted again
when processing the ref item from the child inode in the log.

This issue has been present since the introduction of the extrefs feature
(2012).

A test case for xfstests follows soon. This test only passes if the previous
patch titled "Btrfs: fix fsync when extend references are added to an inode"
is applied too.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: Deal with another case (and updated test case) where the replay
    logic inserted the same dentry both as a normal ref and as an extref,
    resulting in similar directory corruption.

 fs/btrfs/inode-item.c |  9 +++++++--
 fs/btrfs/tree-log.c   | 44 ++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/inode-item.c b/fs/btrfs/inode-item.c
index 8ffa478..265e03c 100644
--- a/fs/btrfs/inode-item.c
+++ b/fs/btrfs/inode-item.c
@@ -344,6 +344,7 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
 		return -ENOMEM;
 
 	path->leave_spinning = 1;
+	path->skip_release_on_error = 1;
 	ret = btrfs_insert_empty_item(trans, root, path, &key,
 				      ins_len);
 	if (ret == -EEXIST) {
@@ -362,8 +363,12 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
 		ptr = (unsigned long)(ref + 1);
 		ret = 0;
 	} else if (ret < 0) {
-		if (ret == -EOVERFLOW)
-			ret = -EMLINK;
+		if (ret == -EOVERFLOW) {
+			if (find_name_in_backref(path, name, name_len, &ref))
+				ret = -EEXIST;
+			else
+				ret = -EMLINK;
+		}
 		goto out;
 	} else {
 		ref = btrfs_item_ptr(path->nodes[0], path->slots[0],
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index b44880f..fce782d 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -453,8 +453,10 @@ static noinline int overwrite_item(struct btrfs_trans_handle *trans,
 insert:
 	btrfs_release_path(path);
 	/* try to insert the key into the destination tree */
+	path->skip_release_on_error = 1;
 	ret = btrfs_insert_empty_item(trans, root, path,
 				      key, item_size);
+	path->skip_release_on_error = 0;
 
 	/* make sure any existing item is the correct size */
 	if (ret == -EEXIST) {
@@ -466,6 +468,13 @@ insert:
 		else if (found_size < item_size)
 			btrfs_extend_item(root, path,
 					  item_size - found_size);
+	} else if (ret == -EOVERFLOW) {
+		u32 found_size;
+
+		found_size = btrfs_item_size_nr(path->nodes[0],
+						path->slots[0]);
+		ASSERT(item_size > found_size);
+		btrfs_extend_item(root, path, item_size - found_size);
 	} else if (ret) {
 		return ret;
 	}
@@ -844,7 +853,7 @@ out:
 static noinline int backref_in_log(struct btrfs_root *log,
 				   struct btrfs_key *key,
 				   u64 ref_objectid,
-				   char *name, int namelen)
+				   const char *name, int namelen)
 {
 	struct btrfs_path *path;
 	struct btrfs_inode_ref *ref;
@@ -1555,6 +1564,30 @@ static noinline int insert_one_name(struct btrfs_trans_handle *trans,
 }
 
 /*
+ * Return true if an inode reference exists in the log for the given name,
+ * inode and parent inode.
+ */
+static bool name_in_log_ref(struct btrfs_root *log_root,
+			    const char *name, const int name_len,
+			    const u64 dirid, const u64 ino)
+{
+	struct btrfs_key search_key;
+
+	search_key.objectid = ino;
+	search_key.type = BTRFS_INODE_REF_KEY;
+	search_key.offset = dirid;
+	if (backref_in_log(log_root, &search_key, dirid, name, name_len))
+		return true;
+
+	search_key.type = BTRFS_INODE_EXTREF_KEY;
+	search_key.offset = btrfs_extref_hash(dirid, name, name_len);
+	if (backref_in_log(log_root, &search_key, dirid, name, name_len))
+		return true;
+
+	return false;
+}
+
+/*
  * take a single entry in a log directory item and replay it into
  * the subvolume.
  *
@@ -1664,10 +1697,17 @@ out:
 	return ret;
 
 insert:
+	if (name_in_log_ref(root->log_root, name, name_len,
+			    key->objectid, log_key.objectid)) {
+		/* The dentry will be added later. */
+		ret = 0;
+		update_size = false;
+		goto out;
+	}
 	btrfs_release_path(path);
 	ret = insert_one_name(trans, root, path, key->objectid, key->offset,
 			      name, name_len, log_type, &log_key);
-	if (ret && ret != -ENOENT)
+	if (ret && ret != -ENOENT && ret != -EEXIST)
 		goto out;
 	update_size = false;
 	ret = 0;
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH v3] Btrfs: fix fsync log replay for inodes with a mix of regular refs and extrefs
  2015-01-13 16:27 [PATCH] Btrfs: fix fsync log replay for inodes with a mix of regular refs and extrefs Filipe Manana
  2015-01-14  1:28 ` [PATCH v2] " Filipe Manana
@ 2015-01-14  1:52 ` Filipe Manana
  1 sibling, 0 replies; 3+ messages in thread
From: Filipe Manana @ 2015-01-14  1:52 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

If we have an inode with a large number of hard links, some of which may
be extrefs, turn a regular ref into an extref, fsync the inode and then
replay the fsync log (after a crash/reboot), we can endup with an fsync
log that makes the replay code always fail with -EOVERFLOW when processing
the inode's references.

This is easy to reproduce with the test case I made for xfstests. Its steps
are the following:

   _scratch_mkfs "-O extref" >> $seqres.full 2>&1
   _init_flakey
   _mount_flakey

   # Create a test file with 3001 hard links. This number is large enough to
   # make btrfs start using extrefs at some point even if the fs has the maximum
   # possible leaf/node size (64Kb).
   echo "hello world" > $SCRATCH_MNT/foo
   for i in `seq 1 3000`; do
       ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i`
   done

   # Make sure all metadata and data are durably persisted.
   sync

   # Now remove one link, add a new one with a new name, add another new one with
   # the same name as the one we just removed and fsync the inode.
   rm -f $SCRATCH_MNT/foo_link_0001
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_0001
   rm -f $SCRATCH_MNT/foo_link_0002
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3002
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3003
   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

   # Simulate a crash/power loss. This makes sure the next mount
   # will see an fsync log and will replay that log.

   _load_flakey_table $FLAKEY_DROP_WRITES
   _unmount_flakey

   _load_flakey_table $FLAKEY_ALLOW_WRITES
   _mount_flakey

   # Check that the number of hard links is correct, we are able to remove all
   # the hard links and read the file's data. This is just to verify we don't
   # get stale file handle errors (due to dangling directory index entries that
   # point to inodes that no longer exist).
   echo "Link count: $(stat --format=%h $SCRATCH_MNT/foo)"
   [ -f $SCRATCH_MNT/foo ] || echo "Link foo is missing"
   for ((i = 1; i <= 3003; i++)); do
       name=foo_link_`printf "%04d" $i`
       if [ $i -eq 2 ]; then
           [ -f $SCRATCH_MNT/$name ] && echo "Link $name found"
       else
           [ -f $SCRATCH_MNT/$name ] || echo "Link $name is missing"
       fi
   done
   rm -f $SCRATCH_MNT/foo_link_*
   cat $SCRATCH_MNT/foo
   rm -f $SCRATCH_MNT/foo

   status=0
   exit

The fix is simply to correct the overflow condition when overwriting a
reference item because it was wrong, trying to increase the item in the
fs/subvol tree by an impossible amount. Also ensure that we don't insert
one normal ref and one ext ref for the same dentry - this happened because
processing a dir index entry from the parent in the log happened when
the normal ref item was full, which made the logic insert an extref and
later when the normal ref had enough room, it would be inserted again
when processing the ref item from the child inode in the log.

This issue has been present since the introduction of the extrefs feature
(2012).

A test case for xfstests follows soon. This test only passes if the previous
patch titled "Btrfs: fix fsync when extend references are added to an inode"
is applied too.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: Deal with another case (and updated test case) where the replay
    logic inserted the same dentry both as a normal ref and as an extref,
    resulting in similar directory corruption.
V3: Unified eexist and eoverflow logic in overwrite_item().

 fs/btrfs/inode-item.c |  9 +++++++--
 fs/btrfs/tree-log.c   | 39 ++++++++++++++++++++++++++++++++++++---
 2 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/inode-item.c b/fs/btrfs/inode-item.c
index 8ffa478..265e03c 100644
--- a/fs/btrfs/inode-item.c
+++ b/fs/btrfs/inode-item.c
@@ -344,6 +344,7 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
 		return -ENOMEM;
 
 	path->leave_spinning = 1;
+	path->skip_release_on_error = 1;
 	ret = btrfs_insert_empty_item(trans, root, path, &key,
 				      ins_len);
 	if (ret == -EEXIST) {
@@ -362,8 +363,12 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
 		ptr = (unsigned long)(ref + 1);
 		ret = 0;
 	} else if (ret < 0) {
-		if (ret == -EOVERFLOW)
-			ret = -EMLINK;
+		if (ret == -EOVERFLOW) {
+			if (find_name_in_backref(path, name, name_len, &ref))
+				ret = -EEXIST;
+			else
+				ret = -EMLINK;
+		}
 		goto out;
 	} else {
 		ref = btrfs_item_ptr(path->nodes[0], path->slots[0],
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index b44880f..920fe42 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -453,11 +453,13 @@ static noinline int overwrite_item(struct btrfs_trans_handle *trans,
 insert:
 	btrfs_release_path(path);
 	/* try to insert the key into the destination tree */
+	path->skip_release_on_error = 1;
 	ret = btrfs_insert_empty_item(trans, root, path,
 				      key, item_size);
+	path->skip_release_on_error = 0;
 
 	/* make sure any existing item is the correct size */
-	if (ret == -EEXIST) {
+	if (ret == -EEXIST || ret == -EOVERFLOW) {
 		u32 found_size;
 		found_size = btrfs_item_size_nr(path->nodes[0],
 						path->slots[0]);
@@ -844,7 +846,7 @@ out:
 static noinline int backref_in_log(struct btrfs_root *log,
 				   struct btrfs_key *key,
 				   u64 ref_objectid,
-				   char *name, int namelen)
+				   const char *name, int namelen)
 {
 	struct btrfs_path *path;
 	struct btrfs_inode_ref *ref;
@@ -1555,6 +1557,30 @@ static noinline int insert_one_name(struct btrfs_trans_handle *trans,
 }
 
 /*
+ * Return true if an inode reference exists in the log for the given name,
+ * inode and parent inode.
+ */
+static bool name_in_log_ref(struct btrfs_root *log_root,
+			    const char *name, const int name_len,
+			    const u64 dirid, const u64 ino)
+{
+	struct btrfs_key search_key;
+
+	search_key.objectid = ino;
+	search_key.type = BTRFS_INODE_REF_KEY;
+	search_key.offset = dirid;
+	if (backref_in_log(log_root, &search_key, dirid, name, name_len))
+		return true;
+
+	search_key.type = BTRFS_INODE_EXTREF_KEY;
+	search_key.offset = btrfs_extref_hash(dirid, name, name_len);
+	if (backref_in_log(log_root, &search_key, dirid, name, name_len))
+		return true;
+
+	return false;
+}
+
+/*
  * take a single entry in a log directory item and replay it into
  * the subvolume.
  *
@@ -1664,10 +1690,17 @@ out:
 	return ret;
 
 insert:
+	if (name_in_log_ref(root->log_root, name, name_len,
+			    key->objectid, log_key.objectid)) {
+		/* The dentry will be added later. */
+		ret = 0;
+		update_size = false;
+		goto out;
+	}
 	btrfs_release_path(path);
 	ret = insert_one_name(trans, root, path, key->objectid, key->offset,
 			      name, name_len, log_type, &log_key);
-	if (ret && ret != -ENOENT)
+	if (ret && ret != -ENOENT && ret != -EEXIST)
 		goto out;
 	update_size = false;
 	ret = 0;
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-01-14  1:52 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-13 16:27 [PATCH] Btrfs: fix fsync log replay for inodes with a mix of regular refs and extrefs Filipe Manana
2015-01-14  1:28 ` [PATCH v2] " Filipe Manana
2015-01-14  1:52 ` [PATCH v3] " Filipe Manana

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.