All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/12 v5] Btrfs: improve write ahead log with sub transaction
@ 2011-08-06  9:37 Liu Bo
  2011-08-06  9:37 ` [PATCH 01/12 v5] Revert "Btrfs: do not flush csum items of unchanged file data during treelog" Liu Bo
                   ` (12 more replies)
  0 siblings, 13 replies; 18+ messages in thread
From: Liu Bo @ 2011-08-06  9:37 UTC (permalink / raw)
  To: linux-btrfs; +Cc: chris.mason

I've fixed a bug and rebased this to the latest for-linus branch,
and with applying my previous posted patch:

	[PATCH] Btrfs: fix an oops of log replay

, I also test this sub transaction patchset with
a) sysbench 0.4.12 tool and
b) Chris's synctest tool in both _crash_ and _uncrash_ cases, and it works well.

Please test this and feel free to notice me if there are any problems.
Hope that it can get through with no bugs and be ready for merge this time :)

===
I've been working to try to improve the write-ahead log's performance,
and I found that the bottleneck addresses in the checksum items,
especially when we want to make a random write on a large file, e.g a 4G file.

Then a idea for this suggested by Chris is to use sub transaction ids and just
to log the part of inode that had changed since either the last log commit or
the last transaction commit.  And as we also push the sub transid into the btree
blocks, we'll get much faster tree walks.  As a result, we abandon the original
brute force approach, which is "to delete all items of the inode in log",
to making sure we get the most uptodate copies of everything, and instead
we manage to "find and merge", i.e. finding extents in the log tree and merging
in the new extents from the file.

This patchset puts the above idea into code, and although the code is now more
complex, it brings us a great deal of performance improvement:

in my sysbench "write + fsync" test:

        451.01Kb/sec -> 4.3621Mb/sec

In v2, thanks to Chris, we worked together to solve 2 bugs, and after that it
works as expected.
In v3, thanks to Josef, we simplify several code.
In v4, rebase to the latest for-linus branch, Chris hit two problems, and we
solve them.

Since there are some vital changes in recent rc, like "kill trans_mutex" and
"use cur_trans", as David asked, I rebase the patchset to the latest for-linus
branch.

More tests are welcome!


Liu Bo (12):
  Revert "Btrfs: do not flush csum items of unchanged file data during
    treelog"
  Btrfs: introduce sub transaction stuff
  Btrfs: update block generation if should_cow_block fails
  Btrfs: modify btrfs_drop_extents API
  Btrfs: introduce first sub trans
  Btrfs: still update inode trans stuff when size remains unchanged
  Btrfs: improve log with sub transaction
  Btrfs: add checksum check for log
  Btrfs: fix a bug of log check
  Btrfs: kick off useless code
  Btrfs: do not iput inode when inode is still in log
  Btrfs: use the right generation number to read log_root_tree

 fs/btrfs/btrfs_inode.h |   12 ++-
 fs/btrfs/ctree.c       |   87 +++++++++++++------
 fs/btrfs/ctree.h       |    5 +-
 fs/btrfs/disk-io.c     |   23 ++++--
 fs/btrfs/extent-tree.c |   10 ++-
 fs/btrfs/file.c        |   22 ++---
 fs/btrfs/inode.c       |   39 ++++++---
 fs/btrfs/ioctl.c       |    6 +-
 fs/btrfs/relocation.c  |    6 +-
 fs/btrfs/transaction.c |   13 ++-
 fs/btrfs/transaction.h |   19 ++++-
 fs/btrfs/tree-defrag.c |    2 +-
 fs/btrfs/tree-log.c    |  225 ++++++++++++++++++++++++++++++++----------------
 13 files changed, 312 insertions(+), 157 deletions(-)


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 01/12 v5] Revert "Btrfs: do not flush csum items of unchanged file data during treelog"
  2011-08-06  9:37 [PATCH 00/12 v5] Btrfs: improve write ahead log with sub transaction Liu Bo
@ 2011-08-06  9:37 ` Liu Bo
  2011-08-06  9:37 ` [PATCH 02/12 v5] Btrfs: introduce sub transaction stuff Liu Bo
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 18+ messages in thread
From: Liu Bo @ 2011-08-06  9:37 UTC (permalink / raw)
  To: linux-btrfs; +Cc: chris.mason

This reverts commit 8e531cdfeb75269c6c5aae33651cca39707848da.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
---
 fs/btrfs/tree-log.c |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index babee65..f320641 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2689,9 +2689,6 @@ static noinline int copy_items(struct btrfs_trans_handle *trans,
 			extent = btrfs_item_ptr(src, start_slot + i,
 						struct btrfs_file_extent_item);
 
-			if (btrfs_file_extent_generation(src, extent) < trans->transid)
-				continue;
-
 			found_type = btrfs_file_extent_type(src, extent);
 			if (found_type == BTRFS_FILE_EXTENT_REG ||
 			    found_type == BTRFS_FILE_EXTENT_PREALLOC) {
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 02/12 v5] Btrfs: introduce sub transaction stuff
  2011-08-06  9:37 [PATCH 00/12 v5] Btrfs: improve write ahead log with sub transaction Liu Bo
  2011-08-06  9:37 ` [PATCH 01/12 v5] Revert "Btrfs: do not flush csum items of unchanged file data during treelog" Liu Bo
@ 2011-08-06  9:37 ` Liu Bo
  2011-08-09 17:25   ` Mitch Harder
  2011-08-06  9:37 ` [PATCH 03/12 v5] Btrfs: update block generation if should_cow_block fails Liu Bo
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 18+ messages in thread
From: Liu Bo @ 2011-08-06  9:37 UTC (permalink / raw)
  To: linux-btrfs; +Cc: chris.mason

Introduce a new concept "sub transaction",
the relation between transaction and sub transaction is

transaction A       ---> transid = x
   sub trans a(1)   ---> sub_transid = x+1
   sub trans a(2)   ---> sub_transid = x+2
     ... ...
   sub trans a(n-1) ---> sub_transid = x+n-1
   sub trans a(n)   ---> sub_transid = x+n
transaction B       ---> transid = x+n+1
     ... ...

And the most important is
a) a trans handler's transid now gets value from sub transid instead of transid.
b) when a transaction commits, transid may not added by 1, but depend on the
   biggest sub_transaction of the last neighbour transaction,
   i.e.
        B->transid = a(n)->transid + 1,
        (B->transid - A->transid) >= 1
c) we start a new sub transaction after a fsync.

We also ship some 'trans->transid' to 'trans->transaction->transid' to
ensure btrfs works well and to get rid of WARNings.

These are used for the new log code.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
---
 fs/btrfs/ctree.c       |   35 ++++++++++++++++++-----------------
 fs/btrfs/ctree.h       |    1 +
 fs/btrfs/disk-io.c     |    7 ++++---
 fs/btrfs/extent-tree.c |   10 ++++++----
 fs/btrfs/inode.c       |    4 ++--
 fs/btrfs/ioctl.c       |    2 +-
 fs/btrfs/relocation.c  |    6 +++---
 fs/btrfs/transaction.c |   13 ++++++++-----
 fs/btrfs/transaction.h |    1 +
 fs/btrfs/tree-defrag.c |    2 +-
 fs/btrfs/tree-log.c    |   16 ++++++++++++++--
 11 files changed, 59 insertions(+), 38 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 011cab3..41d1d17 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -228,9 +228,9 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans,
 	int level;
 	struct btrfs_disk_key disk_key;
 
-	WARN_ON(root->ref_cows && trans->transid !=
+	WARN_ON(root->ref_cows && trans->transaction->transid !=
 		root->fs_info->running_transaction->transid);
-	WARN_ON(root->ref_cows && trans->transid != root->last_trans);
+	WARN_ON(root->ref_cows && trans->transid < root->last_trans);
 
 	level = btrfs_header_level(buf);
 	if (level == 0)
@@ -425,9 +425,9 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans,
 
 	btrfs_assert_tree_locked(buf);
 
-	WARN_ON(root->ref_cows && trans->transid !=
+	WARN_ON(root->ref_cows && trans->transaction->transid !=
 		root->fs_info->running_transaction->transid);
-	WARN_ON(root->ref_cows && trans->transid != root->last_trans);
+	WARN_ON(root->ref_cows && trans->transid < root->last_trans);
 
 	level = btrfs_header_level(buf);
 
@@ -493,7 +493,8 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans,
 		else
 			parent_start = 0;
 
-		WARN_ON(trans->transid != btrfs_header_generation(parent));
+		WARN_ON(btrfs_header_generation(parent) <
+						trans->transaction->transid);
 		btrfs_set_node_blockptr(parent, parent_slot,
 					cow->start);
 		btrfs_set_node_ptr_generation(parent, parent_slot,
@@ -514,7 +515,7 @@ static inline int should_cow_block(struct btrfs_trans_handle *trans,
 				   struct btrfs_root *root,
 				   struct extent_buffer *buf)
 {
-	if (btrfs_header_generation(buf) == trans->transid &&
+	if (btrfs_header_generation(buf) >= trans->transaction->transid &&
 	    !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) &&
 	    !(root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
 	      btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)))
@@ -542,7 +543,7 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle *trans,
 		       root->fs_info->running_transaction->transid);
 		WARN_ON(1);
 	}
-	if (trans->transid != root->fs_info->generation) {
+	if (trans->transaction->transid != root->fs_info->generation) {
 		printk(KERN_CRIT "trans %llu running %llu\n",
 		       (unsigned long long)trans->transid,
 		       (unsigned long long)root->fs_info->generation);
@@ -645,7 +646,7 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans,
 
 	if (trans->transaction != root->fs_info->running_transaction)
 		WARN_ON(1);
-	if (trans->transid != root->fs_info->generation)
+	if (trans->transaction->transid != root->fs_info->generation)
 		WARN_ON(1);
 
 	parent_nritems = btrfs_header_nritems(parent);
@@ -898,7 +899,7 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
 
 	WARN_ON(path->locks[level] != BTRFS_WRITE_LOCK &&
 		path->locks[level] != BTRFS_WRITE_LOCK_BLOCKING);
-	WARN_ON(btrfs_header_generation(mid) != trans->transid);
+	WARN_ON(btrfs_header_generation(mid) < trans->transaction->transid);
 
 	orig_ptr = btrfs_node_blockptr(mid, orig_slot);
 
@@ -1105,7 +1106,7 @@ static noinline int push_nodes_for_insert(struct btrfs_trans_handle *trans,
 		return 1;
 
 	mid = path->nodes[level];
-	WARN_ON(btrfs_header_generation(mid) != trans->transid);
+	WARN_ON(btrfs_header_generation(mid) < trans->transaction->transid);
 
 	if (level < BTRFS_MAX_LEVEL - 1)
 		parent = path->nodes[level + 1];
@@ -1942,8 +1943,8 @@ static int push_node_left(struct btrfs_trans_handle *trans,
 	src_nritems = btrfs_header_nritems(src);
 	dst_nritems = btrfs_header_nritems(dst);
 	push_items = BTRFS_NODEPTRS_PER_BLOCK(root) - dst_nritems;
-	WARN_ON(btrfs_header_generation(src) != trans->transid);
-	WARN_ON(btrfs_header_generation(dst) != trans->transid);
+	WARN_ON(btrfs_header_generation(src) < trans->transaction->transid);
+	WARN_ON(btrfs_header_generation(dst) < trans->transaction->transid);
 
 	if (!empty && src_nritems <= 8)
 		return 1;
@@ -2005,8 +2006,8 @@ static int balance_node_right(struct btrfs_trans_handle *trans,
 	int dst_nritems;
 	int ret = 0;
 
-	WARN_ON(btrfs_header_generation(src) != trans->transid);
-	WARN_ON(btrfs_header_generation(dst) != trans->transid);
+	WARN_ON(btrfs_header_generation(src) < trans->transaction->transid);
+	WARN_ON(btrfs_header_generation(dst) < trans->transaction->transid);
 
 	src_nritems = btrfs_header_nritems(src);
 	dst_nritems = btrfs_header_nritems(dst);
@@ -2097,7 +2098,7 @@ static noinline int insert_new_root(struct btrfs_trans_handle *trans,
 	btrfs_set_node_key(c, &lower_key, 0);
 	btrfs_set_node_blockptr(c, 0, lower->start);
 	lower_gen = btrfs_header_generation(lower);
-	WARN_ON(lower_gen != trans->transid);
+	WARN_ON(lower_gen < trans->transaction->transid);
 
 	btrfs_set_node_ptr_generation(c, 0, lower_gen);
 
@@ -2177,7 +2178,7 @@ static noinline int split_node(struct btrfs_trans_handle *trans,
 	u32 c_nritems;
 
 	c = path->nodes[level];
-	WARN_ON(btrfs_header_generation(c) != trans->transid);
+	WARN_ON(btrfs_header_generation(c) < trans->transaction->transid);
 	if (c == root->node) {
 		/* trying to split the root, lets make a new one */
 		ret = insert_new_root(trans, root, path, level + 1);
@@ -3751,7 +3752,7 @@ static noinline int btrfs_del_leaf(struct btrfs_trans_handle *trans,
 {
 	int ret;
 
-	WARN_ON(btrfs_header_generation(leaf) != trans->transid);
+	WARN_ON(btrfs_header_generation(leaf) < trans->transaction->transid);
 	ret = del_ptr(trans, root, path, 1, path->slots[1]);
 	if (ret)
 		return ret;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a6263bd..310f586 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -925,6 +925,7 @@ struct btrfs_fs_info {
 	struct mutex durable_block_rsv_mutex;
 
 	u64 generation;
+	u64 sub_generation;
 	u64 last_trans_committed;
 
 	/*
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 94ecac3..50e74b1 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1035,7 +1035,7 @@ int clean_tree_block(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 		     struct extent_buffer *buf)
 {
 	struct inode *btree_inode = root->fs_info->btree_inode;
-	if (btrfs_header_generation(buf) ==
+	if (btrfs_header_generation(buf) >=
 	    root->fs_info->running_transaction->transid) {
 		btrfs_assert_tree_locked(buf);
 
@@ -1559,7 +1559,7 @@ static int transaction_kthread(void *arg)
 
 		trans = btrfs_join_transaction(root);
 		BUG_ON(IS_ERR(trans));
-		if (transid == trans->transid) {
+		if (transid == trans->transaction->transid) {
 			ret = btrfs_commit_transaction(trans, root);
 			BUG_ON(ret);
 		} else {
@@ -2001,6 +2001,7 @@ struct btrfs_root *open_ctree(struct super_block *sb,
 	csum_root->track_dirty = 1;
 
 	fs_info->generation = generation;
+	fs_info->sub_generation = generation;
 	fs_info->last_trans_committed = generation;
 	fs_info->data_alloc_profile = (u64)-1;
 	fs_info->metadata_alloc_profile = (u64)-1;
@@ -2671,7 +2672,7 @@ void btrfs_mark_buffer_dirty(struct extent_buffer *buf)
 	int was_dirty;
 
 	btrfs_assert_tree_locked(buf);
-	if (transid != root->fs_info->generation) {
+	if (transid < root->fs_info->generation) {
 		printk(KERN_CRIT "btrfs transid mismatch buffer %llu, "
 		       "found %llu running %llu\n",
 			(unsigned long long)buf->start,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 66bac22..fa101ab 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4363,7 +4363,7 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
 	list_for_each_entry_safe(block_rsv, next_rsv,
 				 &fs_info->durable_block_rsv_list, list) {
 
-		idx = trans->transid & 0x1;
+		idx = trans->transaction->transid & 0x1;
 		if (block_rsv->freed[idx] > 0) {
 			block_rsv_add_bytes(block_rsv,
 					    block_rsv->freed[idx], 0);
@@ -4680,7 +4680,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 	if (block_rsv->space_info != cache->space_info)
 		goto out;
 
-	if (btrfs_header_generation(buf) == trans->transid) {
+	if (btrfs_header_generation(buf) >= trans->transaction->transid) {
 		if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
 			ret = check_ref_cleanup(trans, root, buf->start);
 			if (!ret)
@@ -4730,7 +4730,8 @@ pin:
 
 		if (ret) {
 			spin_lock(&block_rsv->lock);
-			block_rsv->freed[trans->transid & 0x1] += buf->len;
+			block_rsv->freed[trans->transaction->transid & 0x1] +=
+								       buf->len;
 			spin_unlock(&block_rsv->lock);
 		}
 	}
@@ -6164,7 +6165,8 @@ static noinline int walk_up_proc(struct btrfs_trans_handle *trans,
 		}
 		/* make block locked assertion in clean_tree_block happy */
 		if (!path->locks[level] &&
-		    btrfs_header_generation(eb) == trans->transid) {
+		    btrfs_header_generation(eb) >=
+						 trans->transaction->transid) {
 			btrfs_tree_lock(eb);
 			btrfs_set_lock_blocking(eb);
 			path->locks[level] = BTRFS_WRITE_LOCK_BLOCKING;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 34195f9..99eb0b3 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2106,7 +2106,7 @@ void btrfs_orphan_pre_snapshot(struct btrfs_trans_handle *trans,
 	 * space than it frees. So we should make sure there is enough
 	 * reserved space.
 	 */
-	index = trans->transid & 0x1;
+	index = trans->transaction->transid & 0x1;
 	if (block_rsv->reserved + block_rsv->freed[index] < block_rsv->size) {
 		num_bytes += block_rsv->size -
 			     (block_rsv->reserved + block_rsv->freed[index]);
@@ -2130,7 +2130,7 @@ void btrfs_orphan_post_snapshot(struct btrfs_trans_handle *trans,
 
 	/* refill source subvolume's orphan block reservation */
 	block_rsv = root->orphan_block_rsv;
-	index = trans->transid & 0x1;
+	index = trans->transaction->transid & 0x1;
 	if (block_rsv->reserved + block_rsv->freed[index] < block_rsv->size) {
 		num_bytes = block_rsv->size -
 			    (block_rsv->reserved + block_rsv->freed[index]);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 2bb0886..bc9a2ad 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2758,7 +2758,7 @@ static noinline long btrfs_ioctl_start_sync(struct file *file, void __user *argp
 	trans = btrfs_start_transaction(root, 0);
 	if (IS_ERR(trans))
 		return PTR_ERR(trans);
-	transid = trans->transid;
+	transid = trans->transaction->transid;
 	ret = btrfs_commit_transaction_async(trans, root, 0);
 	if (ret) {
 		btrfs_end_transaction(trans, root);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 59bb176..3063be1 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -469,7 +469,7 @@ static int update_backref_cache(struct btrfs_trans_handle *trans,
 		return 0;
 	}
 
-	if (cache->last_trans == trans->transid)
+	if (cache->last_trans >= trans->transaction->transid)
 		return 0;
 
 	/*
@@ -1281,7 +1281,7 @@ static struct btrfs_root *create_reloc_root(struct btrfs_trans_handle *trans,
 		BUG_ON(ret);
 
 		btrfs_set_root_last_snapshot(&root->root_item,
-					     trans->transid - 1);
+					     trans->transaction->transid - 1);
 	} else {
 		/*
 		 * called by btrfs_reloc_post_snapshot_hook.
@@ -2271,7 +2271,7 @@ static int record_reloc_root_in_trans(struct btrfs_trans_handle *trans,
 {
 	struct btrfs_root *root;
 
-	if (reloc_root->last_trans == trans->transid)
+	if (reloc_root->last_trans >= trans->transaction->transid)
 		return 0;
 
 	root = read_fs_root(reloc_root->fs_info, reloc_root->root_key.offset);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 7dc36fa..531b0dc 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -113,7 +113,9 @@ static noinline int join_transaction(struct btrfs_root *root, int nofail)
 	extent_io_tree_init(&cur_trans->dirty_pages,
 			     root->fs_info->btree_inode->i_mapping);
 	root->fs_info->generation++;
+	root->fs_info->sub_generation = root->fs_info->generation;
 	cur_trans->transid = root->fs_info->generation;
+	cur_trans->sub_transid = cur_trans->transid;
 	root->fs_info->running_transaction = cur_trans;
 	spin_unlock(&root->fs_info->trans_lock);
 
@@ -129,7 +131,7 @@ static noinline int join_transaction(struct btrfs_root *root, int nofail)
 static int record_root_in_trans(struct btrfs_trans_handle *trans,
 			       struct btrfs_root *root)
 {
-	if (root->ref_cows && root->last_trans < trans->transid) {
+	if (root->ref_cows && root->last_trans < trans->transaction->transid) {
 		WARN_ON(root == root->fs_info->extent_root);
 		WARN_ON(root->commit_root != root->node);
 
@@ -146,7 +148,7 @@ static int record_root_in_trans(struct btrfs_trans_handle *trans,
 		smp_wmb();
 
 		spin_lock(&root->fs_info->fs_roots_radix_lock);
-		if (root->last_trans == trans->transid) {
+		if (root->last_trans >= trans->transaction->transid) {
 			spin_unlock(&root->fs_info->fs_roots_radix_lock);
 			return 0;
 		}
@@ -194,7 +196,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
 	 * and barriers
 	 */
 	smp_rmb();
-	if (root->last_trans == trans->transid &&
+	if (root->last_trans >= trans->transaction->transid &&
 	    !root->in_trans_setup)
 		return 0;
 
@@ -302,7 +304,7 @@ again:
 
 	cur_trans = root->fs_info->running_transaction;
 
-	h->transid = cur_trans->transid;
+	h->transid = cur_trans->sub_transid;
 	h->transaction = cur_trans;
 	h->blocks_used = 0;
 	h->bytes_reserved = 0;
@@ -1346,6 +1348,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 
 	trans->transaction->blocked = 0;
 	spin_lock(&root->fs_info->trans_lock);
+	root->fs_info->generation = cur_trans->sub_transid;
 	root->fs_info->running_transaction = NULL;
 	root->fs_info->trans_no_join = 0;
 	spin_unlock(&root->fs_info->trans_lock);
@@ -1367,7 +1370,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 
 	cur_trans->commit_done = 1;
 
-	root->fs_info->last_trans_committed = cur_trans->transid;
+	root->fs_info->last_trans_committed = cur_trans->sub_transid;
 
 	wake_up(&cur_trans->commit_wait);
 
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 02564e6..45876b0 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -23,6 +23,7 @@
 
 struct btrfs_transaction {
 	u64 transid;
+	u64 sub_transid;
 	/*
 	 * total writers in this transaction, it must be zero before the
 	 * transaction can end
diff --git a/fs/btrfs/tree-defrag.c b/fs/btrfs/tree-defrag.c
index 3b580ee..a2569af 100644
--- a/fs/btrfs/tree-defrag.c
+++ b/fs/btrfs/tree-defrag.c
@@ -139,7 +139,7 @@ done:
 	if (ret != -EAGAIN) {
 		memset(&root->defrag_progress, 0,
 		       sizeof(root->defrag_progress));
-		root->defrag_trans_start = trans->transid;
+		root->defrag_trans_start = trans->transaction->transid;
 	}
 	return ret;
 }
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index f320641..5a026d2 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -134,9 +134,19 @@ static noinline int replay_dir_deletes(struct btrfs_trans_handle *trans,
 static int start_log_trans(struct btrfs_trans_handle *trans,
 			   struct btrfs_root *root)
 {
+	struct btrfs_transaction *cur_trans;
 	int ret;
 	int err = 0;
 
+	/* start a new sub transaction */
+	spin_lock(&root->fs_info->trans_lock);
+
+	cur_trans = root->fs_info->running_transaction;
+	cur_trans->sub_transid++;
+	root->fs_info->sub_generation = cur_trans->sub_transid;
+
+	spin_unlock(&root->fs_info->trans_lock);
+
 	mutex_lock(&root->log_mutex);
 	if (root->log_root) {
 		if (!root->log_start_pid) {
@@ -2007,7 +2017,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	}
 
 	/* bail out if we need to do a full commit */
-	if (root->fs_info->last_trans_log_full_commit == trans->transid) {
+	if (root->fs_info->last_trans_log_full_commit >=
+						trans->transaction->transid) {
 		ret = -EAGAIN;
 		mutex_unlock(&root->log_mutex);
 		goto out;
@@ -2084,7 +2095,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	 * now that we've moved on to the tree of log tree roots,
 	 * check the full commit flag again
 	 */
-	if (root->fs_info->last_trans_log_full_commit == trans->transid) {
+	if (root->fs_info->last_trans_log_full_commit >=
+						trans->transaction->transid) {
 		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
 		mutex_unlock(&log_root_tree->log_mutex);
 		ret = -EAGAIN;
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 03/12 v5] Btrfs: update block generation if should_cow_block fails
  2011-08-06  9:37 [PATCH 00/12 v5] Btrfs: improve write ahead log with sub transaction Liu Bo
  2011-08-06  9:37 ` [PATCH 01/12 v5] Revert "Btrfs: do not flush csum items of unchanged file data during treelog" Liu Bo
  2011-08-06  9:37 ` [PATCH 02/12 v5] Btrfs: introduce sub transaction stuff Liu Bo
@ 2011-08-06  9:37 ` Liu Bo
  2011-08-06  9:37 ` [PATCH 04/12 v5] Btrfs: modify btrfs_drop_extents API Liu Bo
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 18+ messages in thread
From: Liu Bo @ 2011-08-06  9:37 UTC (permalink / raw)
  To: linux-btrfs; +Cc: chris.mason

Cause we've added sub transaction, if it do not want to cow a block, we also
need to get new sub transid recorded.  Thus we need to acquire write lock
ahead.

This is used for log code to find the most uptodate file extents.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
---
 fs/btrfs/ctree.c |   52 ++++++++++++++++++++++++++++++++++++++++++----------
 1 files changed, 42 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 41d1d17..548246c 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -511,6 +511,33 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans,
 	return 0;
 }
 
+static inline void update_block_generation(struct btrfs_trans_handle *trans,
+					   struct btrfs_root *root,
+					   struct extent_buffer *buf,
+					   struct extent_buffer *parent,
+					   int slot)
+{
+	/*
+	 * If it does not need to cow this block, we still need to
+	 * update the block's generation, for transid may have been
+	 * changed during fsync.
+	*/
+	if (btrfs_header_generation(buf) == trans->transid)
+		return;
+
+	if (buf == root->node) {
+		btrfs_set_header_generation(buf, trans->transid);
+		btrfs_mark_buffer_dirty(buf);
+		add_root_to_dirty_list(root);
+	} else {
+		btrfs_set_node_ptr_generation(parent, slot,
+					      trans->transid);
+		btrfs_set_header_generation(buf, trans->transid);
+		btrfs_mark_buffer_dirty(parent);
+		btrfs_mark_buffer_dirty(buf);
+	}
+}
+
 static inline int should_cow_block(struct btrfs_trans_handle *trans,
 				   struct btrfs_root *root,
 				   struct extent_buffer *buf)
@@ -551,6 +578,7 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle *trans,
 	}
 
 	if (!should_cow_block(trans, root, buf)) {
+		update_block_generation(trans, root, buf, parent, parent_slot);
 		*cow_ret = buf;
 		return 0;
 	}
@@ -1699,16 +1727,6 @@ again:
 		 */
 		if (cow) {
 			/*
-			 * if we don't really need to cow this block
-			 * then we don't want to set the path blocking,
-			 * so we test it here
-			 */
-			if (!should_cow_block(trans, root, b))
-				goto cow_done;
-
-			btrfs_set_path_blocking(p);
-
-			/*
 			 * must have write locks on this node and the
 			 * parent
 			 */
@@ -1718,6 +1736,20 @@ again:
 				goto again;
 			}
 
+			/*
+			 * if we don't really need to cow this block
+			 * then we don't want to set the path blocking,
+			 * so we test it here
+			 */
+			if (!should_cow_block(trans, root, b)) {
+				update_block_generation(trans, root, b,
+							p->nodes[level + 1],
+							p->slots[level + 1]);
+				goto cow_done;
+			}
+
+			btrfs_set_path_blocking(p);
+
 			err = btrfs_cow_block(trans, root, b,
 					      p->nodes[level + 1],
 					      p->slots[level + 1], &b);
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 04/12 v5] Btrfs: modify btrfs_drop_extents API
  2011-08-06  9:37 [PATCH 00/12 v5] Btrfs: improve write ahead log with sub transaction Liu Bo
                   ` (2 preceding siblings ...)
  2011-08-06  9:37 ` [PATCH 03/12 v5] Btrfs: update block generation if should_cow_block fails Liu Bo
@ 2011-08-06  9:37 ` Liu Bo
  2011-08-06  9:37 ` [PATCH 05/12 v5] Btrfs: introduce first sub trans Liu Bo
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 18+ messages in thread
From: Liu Bo @ 2011-08-06  9:37 UTC (permalink / raw)
  To: linux-btrfs; +Cc: chris.mason

We want to use btrfs_drop_extent() in log code.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
---
 fs/btrfs/ctree.h    |    3 ++-
 fs/btrfs/file.c     |    9 +++++++--
 fs/btrfs/inode.c    |    6 +++---
 fs/btrfs/ioctl.c    |    4 ++--
 fs/btrfs/tree-log.c |    2 +-
 5 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 310f586..a8336e7 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2615,7 +2615,8 @@ int btrfs_drop_extent_cache(struct inode *inode, u64 start, u64 end,
 			    int skip_pinned);
 extern const struct file_operations btrfs_file_operations;
 int btrfs_drop_extents(struct btrfs_trans_handle *trans, struct inode *inode,
-		       u64 start, u64 end, u64 *hint_byte, int drop_cache);
+		       u64 start, u64 end, u64 *hint_byte, int drop_cache,
+		       int log);
 int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
 			      struct inode *inode, u64 start, u64 end);
 int btrfs_release_file(struct inode *inode, struct file *file);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 010aec8..7a84219 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -546,7 +546,8 @@ int btrfs_drop_extent_cache(struct inode *inode, u64 start, u64 end,
  * is deleted from the tree.
  */
 int btrfs_drop_extents(struct btrfs_trans_handle *trans, struct inode *inode,
-		       u64 start, u64 end, u64 *hint_byte, int drop_cache)
+		       u64 start, u64 end, u64 *hint_byte, int drop_cache,
+		       int log)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct extent_buffer *leaf;
@@ -566,6 +567,10 @@ int btrfs_drop_extents(struct btrfs_trans_handle *trans, struct inode *inode,
 	int recow;
 	int ret;
 
+	/* drop the existed extents in log tree */
+	if (log)
+		root = root->log_root;
+
 	if (drop_cache)
 		btrfs_drop_extent_cache(inode, start, end - 1, 0);
 
@@ -746,7 +751,7 @@ next_slot:
 						extent_end - key.offset);
 				extent_end = ALIGN(extent_end,
 						   root->sectorsize);
-			} else if (disk_bytenr > 0) {
+			} else if (disk_bytenr > 0 && !log) {
 				ret = btrfs_free_extent(trans, root,
 						disk_bytenr, num_bytes, 0,
 						root->root_key.objectid,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 99eb0b3..b50ca12 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -246,7 +246,7 @@ static noinline int cow_file_range_inline(struct btrfs_trans_handle *trans,
 	}
 
 	ret = btrfs_drop_extents(trans, inode, start, aligned_end,
-				 &hint_byte, 1);
+				 &hint_byte, 1, 0);
 	BUG_ON(ret);
 
 	if (isize > actual_end)
@@ -1657,7 +1657,7 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 	 * with the others.
 	 */
 	ret = btrfs_drop_extents(trans, inode, file_pos, file_pos + num_bytes,
-				 &hint, 0);
+				 &hint, 0, 0);
 	BUG_ON(ret);
 
 	ins.objectid = btrfs_ino(inode);
@@ -3509,7 +3509,7 @@ int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size)
 
 			err = btrfs_drop_extents(trans, inode, cur_offset,
 						 cur_offset + hole_size,
-						 &hint_byte, 1);
+						 &hint_byte, 1, 0);
 			if (err)
 				break;
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index bc9a2ad..3448dbc 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2340,7 +2340,7 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
 				ret = btrfs_drop_extents(trans, inode,
 							 new_key.offset,
 							 new_key.offset + datal,
-							 &hint_byte, 1);
+							 &hint_byte, 1, 0);
 				BUG_ON(ret);
 
 				ret = btrfs_insert_empty_item(trans, root, path,
@@ -2395,7 +2395,7 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
 				ret = btrfs_drop_extents(trans, inode,
 							 new_key.offset,
 							 new_key.offset + datal,
-							 &hint_byte, 1);
+							 &hint_byte, 1, 0);
 				BUG_ON(ret);
 
 				ret = btrfs_insert_empty_item(trans, root, path,
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 5a026d2..a75990a 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -562,7 +562,7 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans,
 	saved_nbytes = inode_get_bytes(inode);
 	/* drop any overlapping extents */
 	ret = btrfs_drop_extents(trans, inode, start, extent_end,
-				 &alloc_hint, 1);
+				 &alloc_hint, 1, 0);
 	BUG_ON(ret);
 
 	if (found_type == BTRFS_FILE_EXTENT_REG ||
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 05/12 v5] Btrfs: introduce first sub trans
  2011-08-06  9:37 [PATCH 00/12 v5] Btrfs: improve write ahead log with sub transaction Liu Bo
                   ` (3 preceding siblings ...)
  2011-08-06  9:37 ` [PATCH 04/12 v5] Btrfs: modify btrfs_drop_extents API Liu Bo
@ 2011-08-06  9:37 ` Liu Bo
  2011-08-06  9:37 ` [PATCH 06/12 v5] Btrfs: still update inode trans stuff when size remains unchanged Liu Bo
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 18+ messages in thread
From: Liu Bo @ 2011-08-06  9:37 UTC (permalink / raw)
  To: linux-btrfs; +Cc: chris.mason

In multi-thread situations, writeback of a file may span across several
sub transactions, and we need to introduce first_sub_trans to get sub_transid of
teh first sub transaction recorded, so that log code can skip file extents which
have been logged or committed onto disk.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
---
 fs/btrfs/btrfs_inode.h |    9 +++++++++
 fs/btrfs/inode.c       |   13 ++++++++++++-
 fs/btrfs/transaction.h |   17 ++++++++++++++++-
 3 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 502b9e9..ba768b0 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -83,6 +83,15 @@ struct btrfs_inode {
 	/* sequence number for NFS changes */
 	u64 sequence;
 
+	/* used to avoid race of first_sub_trans */
+	spinlock_t sub_trans_lock;
+
+	/*
+	 * sub transid of the trans that first modified this inode before
+	 * a trans commit or a log sync
+	 */
+	u64 first_sub_trans;
+
 	/*
 	 * transid of the trans_handle that last modified this inode
 	 */
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b50ca12..eb481ea 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6473,7 +6473,16 @@ again:
 	set_page_dirty(page);
 	SetPageUptodate(page);
 
-	BTRFS_I(inode)->last_trans = root->fs_info->generation;
+	spin_lock(&BTRFS_I(inode)->sub_trans_lock);
+
+	if (BTRFS_I(inode)->first_sub_trans > root->fs_info->sub_generation ||
+	    BTRFS_I(inode)->last_trans <= BTRFS_I(inode)->logged_trans ||
+	    BTRFS_I(inode)->last_trans <= root->fs_info->last_trans_committed)
+		BTRFS_I(inode)->first_sub_trans = root->fs_info->sub_generation;
+
+	spin_unlock(&BTRFS_I(inode)->sub_trans_lock);
+
+	BTRFS_I(inode)->last_trans = root->fs_info->sub_generation;
 	BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
 
 	unlock_extent_cached(io_tree, page_start, page_end, &cached_state, GFP_NOFS);
@@ -6706,6 +6715,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	ei->space_info = NULL;
 	ei->generation = 0;
 	ei->sequence = 0;
+	ei->first_sub_trans = 0;
 	ei->last_trans = 0;
 	ei->last_sub_trans = 0;
 	ei->logged_trans = 0;
@@ -6733,6 +6743,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	extent_io_tree_init(&ei->io_tree, &inode->i_data);
 	extent_io_tree_init(&ei->io_failure_tree, &inode->i_data);
 	mutex_init(&ei->log_mutex);
+	spin_lock_init(&ei->sub_trans_lock);
 	btrfs_ordered_inode_tree_init(&ei->ordered_tree);
 	INIT_LIST_HEAD(&ei->i_orphan);
 	INIT_LIST_HEAD(&ei->delalloc_inodes);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 45876b0..f5ca0fd 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -73,7 +73,22 @@ struct btrfs_pending_snapshot {
 static inline void btrfs_set_inode_last_trans(struct btrfs_trans_handle *trans,
 					      struct inode *inode)
 {
-	BTRFS_I(inode)->last_trans = trans->transaction->transid;
+	spin_lock(&BTRFS_I(inode)->sub_trans_lock);
+
+	/*
+	 * We have joined in a transaction, so btrfs_commit_transaction will
+	 * definitely wait for us and it does not need to add a extra
+	 * trans_mutex lock here.
+	 */
+	if (BTRFS_I(inode)->first_sub_trans > trans->transid ||
+	    BTRFS_I(inode)->last_trans <= BTRFS_I(inode)->logged_trans ||
+	    BTRFS_I(inode)->last_trans <=
+			 BTRFS_I(inode)->root->fs_info->last_trans_committed)
+		BTRFS_I(inode)->first_sub_trans = trans->transid;
+
+	spin_unlock(&BTRFS_I(inode)->sub_trans_lock);
+
+	BTRFS_I(inode)->last_trans = trans->transid;
 	BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
 }
 
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 06/12 v5] Btrfs: still update inode trans stuff when size remains unchanged
  2011-08-06  9:37 [PATCH 00/12 v5] Btrfs: improve write ahead log with sub transaction Liu Bo
                   ` (4 preceding siblings ...)
  2011-08-06  9:37 ` [PATCH 05/12 v5] Btrfs: introduce first sub trans Liu Bo
@ 2011-08-06  9:37 ` Liu Bo
  2011-08-06  9:37 ` [PATCH 07/12 v5] Btrfs: improve log with sub transaction Liu Bo
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 18+ messages in thread
From: Liu Bo @ 2011-08-06  9:37 UTC (permalink / raw)
  To: linux-btrfs; +Cc: chris.mason

Due to DIO stuff, commit 1ef30be142d2cc60e2687ef267de864cf31be995 makes btrfs
not call btrfs_update_inode when it does not update i_disk_size, but in buffer
write case, we need to update btrfs internal inode's trans stuff, so that the
log code can find the inode's changes.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
---
 fs/btrfs/inode.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index eb481ea..a30b611 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1789,7 +1789,8 @@ static int btrfs_finish_ordered_io(struct inode *inode, u64 start, u64 end)
 	if (!ret) {
 		ret = btrfs_update_inode(trans, root, inode);
 		BUG_ON(ret);
-	}
+	} else
+		btrfs_set_inode_last_trans(trans, inode);
 	ret = 0;
 out:
 	if (nolock) {
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 07/12 v5] Btrfs: improve log with sub transaction
  2011-08-06  9:37 [PATCH 00/12 v5] Btrfs: improve write ahead log with sub transaction Liu Bo
                   ` (5 preceding siblings ...)
  2011-08-06  9:37 ` [PATCH 06/12 v5] Btrfs: still update inode trans stuff when size remains unchanged Liu Bo
@ 2011-08-06  9:37 ` Liu Bo
  2011-08-06  9:37 ` [PATCH 08/12 v5] Btrfs: add checksum check for log Liu Bo
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 18+ messages in thread
From: Liu Bo @ 2011-08-06  9:37 UTC (permalink / raw)
  To: linux-btrfs; +Cc: chris.mason

When logging an inode _A_, current btrfs will
a) clear all items belonged to _A_ in log,
b) copy all items belonged to _A_ from fs/file tree to log tree,
and this just wastes a lot of time, especially when logging big files.

So we want to use a smarter approach, i.e. "find and merge".
The amount of file extent items is the largest, so we focus on it.
Thanks to sub transaction, now we can find those file extent items which
are changed after last _transaction commit_ or last _log commit_, and
then merge them with the existed ones in log tree.

It will be great helpful on fsync performance, cause the common case is
"make changes on a _part_ of inode".

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
---
 fs/btrfs/tree-log.c |  180 ++++++++++++++++++++++++++++++++++++---------------
 1 files changed, 128 insertions(+), 52 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index a75990a..5394055 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2587,61 +2587,107 @@ again:
 }
 
 /*
- * a helper function to drop items from the log before we relog an
- * inode.  max_key_type indicates the highest item type to remove.
- * This cannot be run for file data extents because it does not
- * free the extents they point to.
+ * a helper function to drop items from the log before we merge
+ * the uptodate items into the log tree.
  */
-static int drop_objectid_items(struct btrfs_trans_handle *trans,
-				  struct btrfs_root *log,
-				  struct btrfs_path *path,
-				  u64 objectid, int max_key_type)
+static int prepare_for_merge_items(struct btrfs_trans_handle *trans,
+				   struct inode *inode,
+				   struct extent_buffer *eb,
+				   int slot, int nr)
 {
-	int ret;
-	struct btrfs_key key;
+	struct btrfs_root *log = BTRFS_I(inode)->root->log_root;
+	struct btrfs_path *path;
 	struct btrfs_key found_key;
+	struct btrfs_key key;
+	int i;
+	int ret;
 
-	key.objectid = objectid;
-	key.type = max_key_type;
-	key.offset = (u64)-1;
+	/* There are no relative items of the inode in log. */
+	if (BTRFS_I(inode)->logged_trans < trans->transaction->transid)
+		return 0;
 
-	while (1) {
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	for (i = slot; i < slot + nr; i++) {
+		btrfs_item_key_to_cpu(eb, &key, i);
+
+		if (btrfs_key_type(&key) == BTRFS_EXTENT_DATA_KEY) {
+			struct btrfs_file_extent_item *fi;
+			int found_type;
+			u64 mask = BTRFS_I(inode)->root->sectorsize - 1;
+			u64 start = key.offset;
+			u64 extent_end;
+			u64 hint;
+			unsigned long size;
+
+			fi = btrfs_item_ptr(eb, i,
+					    struct btrfs_file_extent_item);
+			found_type = btrfs_file_extent_type(eb, fi);
+
+			if (found_type == BTRFS_FILE_EXTENT_REG ||
+			    found_type == BTRFS_FILE_EXTENT_PREALLOC) {
+				extent_end = start +
+					    btrfs_file_extent_num_bytes(eb, fi);
+			} else if (found_type == BTRFS_FILE_EXTENT_INLINE) {
+				size = btrfs_file_extent_inline_len(eb, fi);
+				extent_end = (start + size + mask) & ~mask;
+			} else {
+				BUG_ON(1);
+			}
+
+			/* drop any overlapping extents */
+			ret = btrfs_drop_extents(trans, inode, start,
+						 extent_end, &hint, 0, 1);
+			BUG_ON(ret);
+
+			continue;
+		}
+
+		/* non file extent */
 		ret = btrfs_search_slot(trans, log, &key, path, -1, 1);
-		BUG_ON(ret == 0);
 		if (ret < 0)
 			break;
 
-		if (path->slots[0] == 0)
+		/* empty log! */
+		if (ret > 0 && path->slots[0] == 0)
 			break;
 
-		path->slots[0]--;
+		if (ret > 0) {
+			btrfs_release_path(path);
+			continue;
+		}
+
 		btrfs_item_key_to_cpu(path->nodes[0], &found_key,
 				      path->slots[0]);
 
-		if (found_key.objectid != objectid)
-			break;
+		if (btrfs_comp_cpu_keys(&found_key, &key))
+			BUG_ON(1);
 
 		ret = btrfs_del_item(trans, log, path);
-		if (ret)
-			break;
+		BUG_ON(ret);
 		btrfs_release_path(path);
 	}
 	btrfs_release_path(path);
-	return ret;
+	btrfs_free_path(path);
+
+	return 0;
 }
 
 static noinline int copy_items(struct btrfs_trans_handle *trans,
-			       struct btrfs_root *log,
+			       struct inode *inode,
 			       struct btrfs_path *dst_path,
 			       struct extent_buffer *src,
 			       int start_slot, int nr, int inode_only)
 {
 	unsigned long src_offset;
 	unsigned long dst_offset;
+	struct btrfs_root *log = BTRFS_I(inode)->root->log_root;
 	struct btrfs_file_extent_item *extent;
 	struct btrfs_inode_item *inode_item;
-	int ret;
 	struct btrfs_key *ins_keys;
+	int ret;
 	u32 *ins_sizes;
 	char *ins_data;
 	int i;
@@ -2649,6 +2695,10 @@ static noinline int copy_items(struct btrfs_trans_handle *trans,
 
 	INIT_LIST_HEAD(&ordered_sums);
 
+	ret = prepare_for_merge_items(trans, inode, src, start_slot, nr);
+	if (ret)
+		return ret;
+
 	ins_data = kmalloc(nr * sizeof(struct btrfs_key) +
 			   nr * sizeof(u32), GFP_NOFS);
 	if (!ins_data)
@@ -2752,6 +2802,34 @@ static noinline int copy_items(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+/*
+ * a helper function to filter the old file extent items by checking their
+ * generation.
+ */
+static inline int is_extent_uptodate(struct btrfs_path *path, u64 min_trans)
+{
+	struct btrfs_file_extent_item *fi;
+	struct btrfs_key key;
+	struct extent_buffer *eb;
+	int slot;
+	u64 gen;
+
+	eb = path->nodes[0];
+	slot = path->slots[0];
+
+	btrfs_item_key_to_cpu(eb, &key, slot);
+
+	if (btrfs_key_type(&key) != BTRFS_EXTENT_DATA_KEY)
+		return 1;
+
+	fi = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item);
+	gen = btrfs_file_extent_generation(eb, fi);
+	if (gen < min_trans)
+		return 0;
+
+	return 1;
+}
+
 /* log a single inode in the tree log.
  * At least one parent directory for this inode must exist in the tree
  * or be logged already.
@@ -2782,6 +2860,16 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
 	int ins_start_slot = 0;
 	int ins_nr;
 	u64 ino = btrfs_ino(inode);
+	u64 transid;
+
+	/*
+	* We use transid in btrfs_search_forward() as a filter, in order to
+	* find the uptodate block (node or leaf).
+	*/
+	if (BTRFS_I(inode)->first_sub_trans > trans->transaction->transid)
+		transid = BTRFS_I(inode)->first_sub_trans;
+	else
+		transid = trans->transaction->transid;
 
 	log = root->log_root;
 
@@ -2819,29 +2907,12 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
 
 	mutex_lock(&BTRFS_I(inode)->log_mutex);
 
-	/*
-	 * a brute force approach to making sure we get the most uptodate
-	 * copies of everything.
-	 */
-	if (S_ISDIR(inode->i_mode)) {
-		int max_key_type = BTRFS_DIR_LOG_INDEX_KEY;
-
-		if (inode_only == LOG_INODE_EXISTS)
-			max_key_type = BTRFS_XATTR_ITEM_KEY;
-		ret = drop_objectid_items(trans, log, path, ino, max_key_type);
-	} else {
-		ret = btrfs_truncate_inode_items(trans, log, inode, 0, 0);
-	}
-	if (ret) {
-		err = ret;
-		goto out_unlock;
-	}
 	path->keep_locks = 1;
 
 	while (1) {
 		ins_nr = 0;
 		ret = btrfs_search_forward(root, &min_key, &max_key,
-					   path, 0, trans->transid);
+					   path, 0, transid);
 		if (ret != 0)
 			break;
 again:
@@ -2852,6 +2923,9 @@ again:
 			break;
 
 		src = path->nodes[0];
+		if (!is_extent_uptodate(path, transid))
+			goto filter;
+
 		if (ins_nr && ins_start_slot + ins_nr == path->slots[0]) {
 			ins_nr++;
 			goto next_slot;
@@ -2860,15 +2934,17 @@ again:
 			ins_nr = 1;
 			goto next_slot;
 		}
-
-		ret = copy_items(trans, log, dst_path, src, ins_start_slot,
-				 ins_nr, inode_only);
-		if (ret) {
-			err = ret;
-			goto out_unlock;
+filter:
+		if (ins_nr) {
+			ret = copy_items(trans, inode, dst_path, src,
+					 ins_start_slot,
+					 ins_nr, inode_only);
+			if (ret) {
+				err = ret;
+				goto out_unlock;
+			}
+			ins_nr = 0;
 		}
-		ins_nr = 1;
-		ins_start_slot = path->slots[0];
 next_slot:
 
 		nritems = btrfs_header_nritems(path->nodes[0]);
@@ -2879,7 +2955,7 @@ next_slot:
 			goto again;
 		}
 		if (ins_nr) {
-			ret = copy_items(trans, log, dst_path, src,
+			ret = copy_items(trans, inode, dst_path, src,
 					 ins_start_slot,
 					 ins_nr, inode_only);
 			if (ret) {
@@ -2900,7 +2976,7 @@ next_slot:
 			break;
 	}
 	if (ins_nr) {
-		ret = copy_items(trans, log, dst_path, src,
+		ret = copy_items(trans, inode, dst_path, src,
 				 ins_start_slot,
 				 ins_nr, inode_only);
 		if (ret) {
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 08/12 v5] Btrfs: add checksum check for log
  2011-08-06  9:37 [PATCH 00/12 v5] Btrfs: improve write ahead log with sub transaction Liu Bo
                   ` (6 preceding siblings ...)
  2011-08-06  9:37 ` [PATCH 07/12 v5] Btrfs: improve log with sub transaction Liu Bo
@ 2011-08-06  9:37 ` Liu Bo
  2011-08-06  9:37 ` [PATCH 09/12 v5] Btrfs: fix a bug of log check Liu Bo
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 18+ messages in thread
From: Liu Bo @ 2011-08-06  9:37 UTC (permalink / raw)
  To: linux-btrfs; +Cc: chris.mason

If a inode is a BTRFS_INODE_NODATASUM one, it need not to look for csum items
any more.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
---
 fs/btrfs/tree-log.c |    8 +++++---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 5394055..15b6f71 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2687,11 +2687,12 @@ static noinline int copy_items(struct btrfs_trans_handle *trans,
 	struct btrfs_file_extent_item *extent;
 	struct btrfs_inode_item *inode_item;
 	struct btrfs_key *ins_keys;
-	int ret;
+	struct list_head ordered_sums;
 	u32 *ins_sizes;
 	char *ins_data;
+	int ret;
 	int i;
-	struct list_head ordered_sums;
+	int csum = (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM) ? 0 : 1;
 
 	INIT_LIST_HEAD(&ordered_sums);
 
@@ -2746,7 +2747,8 @@ static noinline int copy_items(struct btrfs_trans_handle *trans,
 		 * or deletes of this inode don't have to relog the inode
 		 * again
 		 */
-		if (btrfs_key_type(ins_keys + i) == BTRFS_EXTENT_DATA_KEY) {
+		if (btrfs_key_type(ins_keys + i) ==
+						BTRFS_EXTENT_DATA_KEY && csum) {
 			int found_type;
 			extent = btrfs_item_ptr(src, start_slot + i,
 						struct btrfs_file_extent_item);
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 09/12 v5] Btrfs: fix a bug of log check
  2011-08-06  9:37 [PATCH 00/12 v5] Btrfs: improve write ahead log with sub transaction Liu Bo
                   ` (7 preceding siblings ...)
  2011-08-06  9:37 ` [PATCH 08/12 v5] Btrfs: add checksum check for log Liu Bo
@ 2011-08-06  9:37 ` Liu Bo
  2011-08-06  9:37 ` [PATCH 10/12 v5] Btrfs: kick off useless code Liu Bo
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 18+ messages in thread
From: Liu Bo @ 2011-08-06  9:37 UTC (permalink / raw)
  To: linux-btrfs; +Cc: chris.mason

The current code uses struct root's last_log_commit to check if an inode
has been logged, but the problem is that this root->last_log_commit is
shared among files.  Say we have N inodes to be logged, after the first
inode, root-last_log_commit is updated and the N-1 remains will not be
logged.

As we've introduce sub transaction and filled inode's last_trans and
logged_trans with sub_transid instead of transaction id, we can just
compare last_trans with logged_trans to determine if the processing inode
is logged.  And the more important thing is these two values are
inode-individual, so it will not interfere with others.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
---
 fs/btrfs/btrfs_inode.h |    5 -----
 fs/btrfs/ctree.h       |    1 -
 fs/btrfs/disk-io.c     |    2 --
 fs/btrfs/inode.c       |    2 --
 fs/btrfs/transaction.h |    1 -
 fs/btrfs/tree-log.c    |   16 +++-------------
 6 files changed, 3 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index ba768b0..918a51b 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -98,11 +98,6 @@ struct btrfs_inode {
 	u64 last_trans;
 
 	/*
-	 * log transid when this inode was last modified
-	 */
-	u64 last_sub_trans;
-
-	/*
 	 * transid that last logged this inode
 	 */
 	u64 logged_trans;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a8336e7..896a443 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1161,7 +1161,6 @@ struct btrfs_root {
 	atomic_t log_writers;
 	atomic_t log_commit[2];
 	unsigned long log_transid;
-	unsigned long last_log_commit;
 	unsigned long log_batch;
 	pid_t log_start_pid;
 	bool log_multiple_pids;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 50e74b1..7572b2e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1099,7 +1099,6 @@ static int __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	atomic_set(&root->log_writers, 0);
 	root->log_batch = 0;
 	root->log_transid = 0;
-	root->last_log_commit = 0;
 	extent_io_tree_init(&root->dirty_log_pages,
 			     fs_info->btree_inode->i_mapping);
 
@@ -1236,7 +1235,6 @@ int btrfs_add_log_tree(struct btrfs_trans_handle *trans,
 	WARN_ON(root->log_root);
 	root->log_root = log_root;
 	root->log_transid = 0;
-	root->last_log_commit = 0;
 	return 0;
 }
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a30b611..a1f9155 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6484,7 +6484,6 @@ again:
 	spin_unlock(&BTRFS_I(inode)->sub_trans_lock);
 
 	BTRFS_I(inode)->last_trans = root->fs_info->sub_generation;
-	BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
 
 	unlock_extent_cached(io_tree, page_start, page_end, &cached_state, GFP_NOFS);
 
@@ -6718,7 +6717,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	ei->sequence = 0;
 	ei->first_sub_trans = 0;
 	ei->last_trans = 0;
-	ei->last_sub_trans = 0;
 	ei->logged_trans = 0;
 	ei->delalloc_bytes = 0;
 	ei->reserved_bytes = 0;
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index f5ca0fd..fd2474a 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -89,7 +89,6 @@ static inline void btrfs_set_inode_last_trans(struct btrfs_trans_handle *trans,
 	spin_unlock(&BTRFS_I(inode)->sub_trans_lock);
 
 	BTRFS_I(inode)->last_trans = trans->transid;
-	BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
 }
 
 int btrfs_end_transaction(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 15b6f71..e02b8d3 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -1989,7 +1989,6 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	int ret;
 	struct btrfs_root *log = root->log_root;
 	struct btrfs_root *log_root_tree = root->fs_info->log_root_tree;
-	unsigned long log_transid = 0;
 
 	mutex_lock(&root->log_mutex);
 	index1 = root->log_transid % 2;
@@ -2024,8 +2023,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 		goto out;
 	}
 
-	log_transid = root->log_transid;
-	if (log_transid % 2 == 0)
+	if (root->log_transid % 2 == 0)
 		mark = EXTENT_DIRTY;
 	else
 		mark = EXTENT_NEW;
@@ -2132,11 +2130,6 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	btrfs_scrub_continue_super(root);
 	ret = 0;
 
-	mutex_lock(&root->log_mutex);
-	if (root->last_log_commit < log_transid)
-		root->last_log_commit = log_transid;
-	mutex_unlock(&root->log_mutex);
-
 out_wake_log_root:
 	atomic_set(&log_root_tree->log_commit[index2], 0);
 	smp_mb();
@@ -3076,14 +3069,11 @@ out:
 static int inode_in_log(struct btrfs_trans_handle *trans,
 		 struct inode *inode)
 {
-	struct btrfs_root *root = BTRFS_I(inode)->root;
 	int ret = 0;
 
-	mutex_lock(&root->log_mutex);
-	if (BTRFS_I(inode)->logged_trans == trans->transid &&
-	    BTRFS_I(inode)->last_sub_trans <= root->last_log_commit)
+	if (BTRFS_I(inode)->logged_trans >= trans->transaction->transid &&
+	    BTRFS_I(inode)->last_trans <= BTRFS_I(inode)->logged_trans)
 		ret = 1;
-	mutex_unlock(&root->log_mutex);
 	return ret;
 }
 
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 10/12 v5] Btrfs: kick off useless code
  2011-08-06  9:37 [PATCH 00/12 v5] Btrfs: improve write ahead log with sub transaction Liu Bo
                   ` (8 preceding siblings ...)
  2011-08-06  9:37 ` [PATCH 09/12 v5] Btrfs: fix a bug of log check Liu Bo
@ 2011-08-06  9:37 ` Liu Bo
  2011-08-06  9:37 ` [PATCH 11/12 v5] Btrfs: do not iput inode when inode is still in log Liu Bo
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 18+ messages in thread
From: Liu Bo @ 2011-08-06  9:37 UTC (permalink / raw)
  To: linux-btrfs; +Cc: chris.mason

fsync will wait for writeback till it finishes, and last_trans will get the real
transid recorded in writeback, so it does not need an extra +1 to ensure fsync's
process on the file.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
---
 fs/btrfs/file.c |   13 -------------
 1 files changed, 0 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 7a84219..d6df2bc 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1404,19 +1404,6 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
 
 	mutex_unlock(&inode->i_mutex);
 
-	/*
-	 * we want to make sure fsync finds this change
-	 * but we haven't joined a transaction running right now.
-	 *
-	 * Later on, someone is sure to update the inode and get the
-	 * real transid recorded.
-	 *
-	 * We set last_trans now to the fs_info generation + 1,
-	 * this will either be one more than the running transaction
-	 * or the generation used for the next transaction if there isn't
-	 * one running right now.
-	 */
-	BTRFS_I(inode)->last_trans = root->fs_info->generation + 1;
 	if (num_written > 0 || num_written == -EIOCBQUEUED) {
 		err = generic_write_sync(file, pos, num_written);
 		if (err < 0 && num_written > 0)
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 11/12 v5] Btrfs: do not iput inode when inode is still in log
  2011-08-06  9:37 [PATCH 00/12 v5] Btrfs: improve write ahead log with sub transaction Liu Bo
                   ` (9 preceding siblings ...)
  2011-08-06  9:37 ` [PATCH 10/12 v5] Btrfs: kick off useless code Liu Bo
@ 2011-08-06  9:37 ` Liu Bo
  2011-10-17  0:30   ` Chris Mason
  2011-08-06  9:37 ` [PATCH 12/12 v5] Btrfs: use the right generation number to read log_root_tree Liu Bo
  2011-09-01 17:38 ` [PATCH 00/12 v5] Btrfs: improve write ahead log with sub transaction Mitch Harder
  12 siblings, 1 reply; 18+ messages in thread
From: Liu Bo @ 2011-08-06  9:37 UTC (permalink / raw)
  To: linux-btrfs; +Cc: chris.mason

We maintain the inode's logged_trans to avoid reloging it, but if we iput
the inode and reread it, we'll get logged_trans to zero.

So when an inode is still in log tree, and transaction is not committed yet,
we do not iput the inode.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
---
 fs/btrfs/inode.c |   11 +++++++++--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a1f9155..10375fc 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6821,8 +6821,15 @@ int btrfs_drop_inode(struct inode *inode)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 
-	if (btrfs_root_refs(&root->root_item) == 0 &&
-	    !btrfs_is_free_space_inode(root, inode))
+	/*
+	 * If the inode has been in the log tree and the transaction is not
+	 * committed yet, then we need to keep this inode in cache.
+	 */
+	if (BTRFS_I(inode)->last_trans >= root->fs_info->generation &&
+	    BTRFS_I(inode)->logged_trans >= BTRFS_I(inode)->last_trans)
+		return 0;
+	else if (btrfs_root_refs(&root->root_item) == 0 &&
+		 !btrfs_is_free_space_inode(root, inode))
 		return 1;
 	else
 		return generic_drop_inode(inode);
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 12/12 v5] Btrfs: use the right generation number to read log_root_tree
  2011-08-06  9:37 [PATCH 00/12 v5] Btrfs: improve write ahead log with sub transaction Liu Bo
                   ` (10 preceding siblings ...)
  2011-08-06  9:37 ` [PATCH 11/12 v5] Btrfs: do not iput inode when inode is still in log Liu Bo
@ 2011-08-06  9:37 ` Liu Bo
  2011-09-01 17:38 ` [PATCH 00/12 v5] Btrfs: improve write ahead log with sub transaction Mitch Harder
  12 siblings, 0 replies; 18+ messages in thread
From: Liu Bo @ 2011-08-06  9:37 UTC (permalink / raw)
  To: linux-btrfs; +Cc: chris.mason

Currently we use the generation number of the super to read in the log
tree root after a crash.  This doesn't always match the sub trans id and
so it doesn't always match the transid stored in the btree blocks.

We can use log_root_transid to record the log_root_tree's generation
so that when we recover from crash, we can match log_root_tree's btree blocks.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
---
 fs/btrfs/disk-io.c  |   14 ++++++++++++--
 fs/btrfs/tree-log.c |    2 ++
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7572b2e..c2b8351 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2039,7 +2039,8 @@ struct btrfs_root *open_ctree(struct super_block *sb,
 	/* do not make disk changes in broken FS */
 	if (btrfs_super_log_root(disk_super) != 0 &&
 	    !(fs_info->fs_state & BTRFS_SUPER_FLAG_ERROR)) {
-		u64 bytenr = btrfs_super_log_root(disk_super);
+		u64 bytenr;
+		u64 log_root_transid;
 
 		if (fs_devices->rw_devices == 0) {
 			printk(KERN_WARNING "Btrfs log replay required "
@@ -2060,9 +2061,18 @@ struct btrfs_root *open_ctree(struct super_block *sb,
 		__setup_root(nodesize, leafsize, sectorsize, stripesize,
 			     log_tree_root, fs_info, BTRFS_TREE_LOG_OBJECTID);
 
+		bytenr = btrfs_super_log_root(disk_super);
+
+		/* for old btrfs, we need to do something compatible. */
+		if (btrfs_super_log_root_transid(disk_super))
+			log_root_transid =
+				btrfs_super_log_root_transid(disk_super);
+		else
+			log_root_transid = generation + 1;
+
 		log_tree_root->node = read_tree_block(tree_root, bytenr,
 						      blocksize,
-						      generation + 1);
+						      log_root_transid);
 		ret = btrfs_recover_log_trees(log_tree_root);
 		BUG_ON(ret);
 
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index e02b8d3..b96bb5c 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2111,6 +2111,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 				log_root_tree->node->start);
 	btrfs_set_super_log_root_level(&root->fs_info->super_for_commit,
 				btrfs_header_level(log_root_tree->node));
+	btrfs_set_super_log_root_transid(&root->fs_info->super_for_commit,
+					 trans->transid);
 
 	log_root_tree->log_batch = 0;
 	log_root_tree->log_transid++;
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 02/12 v5] Btrfs: introduce sub transaction stuff
  2011-08-06  9:37 ` [PATCH 02/12 v5] Btrfs: introduce sub transaction stuff Liu Bo
@ 2011-08-09 17:25   ` Mitch Harder
  0 siblings, 0 replies; 18+ messages in thread
From: Mitch Harder @ 2011-08-09 17:25 UTC (permalink / raw)
  To: Liu Bo; +Cc: linux-btrfs, chris.mason

On Sat, Aug 6, 2011 at 4:37 AM, Liu Bo <liubo2009@cn.fujitsu.com> wrote=
:
> Introduce a new concept "sub transaction",
> the relation between transaction and sub transaction is
>
> transaction A =A0 =A0 =A0 ---> transid =3D x
> =A0 sub trans a(1) =A0 ---> sub_transid =3D x+1
> =A0 sub trans a(2) =A0 ---> sub_transid =3D x+2
> =A0 =A0 ... ...
> =A0 sub trans a(n-1) ---> sub_transid =3D x+n-1
> =A0 sub trans a(n) =A0 ---> sub_transid =3D x+n
> transaction B =A0 =A0 =A0 ---> transid =3D x+n+1
> =A0 =A0 ... ...
>
> And the most important is
> a) a trans handler's transid now gets value from sub transid instead =
of transid.
> b) when a transaction commits, transid may not added by 1, but depend=
 on the
> =A0 biggest sub_transaction of the last neighbour transaction,
> =A0 i.e.
> =A0 =A0 =A0 =A0B->transid =3D a(n)->transid + 1,
> =A0 =A0 =A0 =A0(B->transid - A->transid) >=3D 1
> c) we start a new sub transaction after a fsync.
>
> We also ship some 'trans->transid' to 'trans->transaction->transid' t=
o
> ensure btrfs works well and to get rid of WARNings.
>
> These are used for the new log code.
>
> Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
> ---
> =A0fs/btrfs/ctree.c =A0 =A0 =A0 | =A0 35 ++++++++++++++++++----------=
-------
> =A0fs/btrfs/ctree.h =A0 =A0 =A0 | =A0 =A01 +
> =A0fs/btrfs/disk-io.c =A0 =A0 | =A0 =A07 ++++---
> =A0fs/btrfs/extent-tree.c | =A0 10 ++++++----
> =A0fs/btrfs/inode.c =A0 =A0 =A0 | =A0 =A04 ++--
> =A0fs/btrfs/ioctl.c =A0 =A0 =A0 | =A0 =A02 +-
> =A0fs/btrfs/relocation.c =A0| =A0 =A06 +++---
> =A0fs/btrfs/transaction.c | =A0 13 ++++++++-----
> =A0fs/btrfs/transaction.h | =A0 =A01 +
> =A0fs/btrfs/tree-defrag.c | =A0 =A02 +-
> =A0fs/btrfs/tree-log.c =A0 =A0| =A0 16 ++++++++++++++--
> =A011 files changed, 59 insertions(+), 38 deletions(-)
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 011cab3..41d1d17 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -228,9 +228,9 @@ int btrfs_copy_root(struct btrfs_trans_handle *tr=
ans,
> =A0 =A0 =A0 =A0int level;
> =A0 =A0 =A0 =A0struct btrfs_disk_key disk_key;
>
> - =A0 =A0 =A0 WARN_ON(root->ref_cows && trans->transid !=3D
> + =A0 =A0 =A0 WARN_ON(root->ref_cows && trans->transaction->transid !=
=3D
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0root->fs_info->running_transaction->tr=
ansid);
> - =A0 =A0 =A0 WARN_ON(root->ref_cows && trans->transid !=3D root->las=
t_trans);
> + =A0 =A0 =A0 WARN_ON(root->ref_cows && trans->transid < root->last_t=
rans);
>
> =A0 =A0 =A0 =A0level =3D btrfs_header_level(buf);
> =A0 =A0 =A0 =A0if (level =3D=3D 0)
> @@ -425,9 +425,9 @@ static noinline int __btrfs_cow_block(struct btrf=
s_trans_handle *trans,
>
> =A0 =A0 =A0 =A0btrfs_assert_tree_locked(buf);
>
> - =A0 =A0 =A0 WARN_ON(root->ref_cows && trans->transid !=3D
> + =A0 =A0 =A0 WARN_ON(root->ref_cows && trans->transaction->transid !=
=3D
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0root->fs_info->running_transaction->tr=
ansid);
> - =A0 =A0 =A0 WARN_ON(root->ref_cows && trans->transid !=3D root->las=
t_trans);
> + =A0 =A0 =A0 WARN_ON(root->ref_cows && trans->transid < root->last_t=
rans);
>
> =A0 =A0 =A0 =A0level =3D btrfs_header_level(buf);
>
> @@ -493,7 +493,8 @@ static noinline int __btrfs_cow_block(struct btrf=
s_trans_handle *trans,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0else
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0parent_start =3D 0;
>
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 WARN_ON(trans->transid !=3D btrfs_heade=
r_generation(parent));
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 WARN_ON(btrfs_header_generation(parent)=
 <
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0 =A0 trans->transaction->transid);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0btrfs_set_node_blockptr(parent, parent=
_slot,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0cow->start);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0btrfs_set_node_ptr_generation(parent, =
parent_slot,
> @@ -514,7 +515,7 @@ static inline int should_cow_block(struct btrfs_t=
rans_handle *trans,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 s=
truct btrfs_root *root,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 s=
truct extent_buffer *buf)
> =A0{
> - =A0 =A0 =A0 if (btrfs_header_generation(buf) =3D=3D trans->transid =
&&
> + =A0 =A0 =A0 if (btrfs_header_generation(buf) >=3D trans->transactio=
n->transid &&
> =A0 =A0 =A0 =A0 =A0 =A0!btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRIT=
TEN) &&
> =A0 =A0 =A0 =A0 =A0 =A0!(root->root_key.objectid !=3D BTRFS_TREE_RELO=
C_OBJECTID &&
> =A0 =A0 =A0 =A0 =A0 =A0 =A0btrfs_header_flag(buf, BTRFS_HEADER_FLAG_R=
ELOC)))
> @@ -542,7 +543,7 @@ noinline int btrfs_cow_block(struct btrfs_trans_h=
andle *trans,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 root->fs_info->running_tr=
ansaction->transid);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0WARN_ON(1);
> =A0 =A0 =A0 =A0}
> - =A0 =A0 =A0 if (trans->transid !=3D root->fs_info->generation) {
> + =A0 =A0 =A0 if (trans->transaction->transid !=3D root->fs_info->gen=
eration) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0printk(KERN_CRIT "trans %llu running %=
llu\n",
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (unsigned long long)trans=
->transid,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (unsigned long long)root-=
>fs_info->generation);
> @@ -645,7 +646,7 @@ int btrfs_realloc_node(struct btrfs_trans_handle =
*trans,
>
> =A0 =A0 =A0 =A0if (trans->transaction !=3D root->fs_info->running_tra=
nsaction)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0WARN_ON(1);
> - =A0 =A0 =A0 if (trans->transid !=3D root->fs_info->generation)
> + =A0 =A0 =A0 if (trans->transaction->transid !=3D root->fs_info->gen=
eration)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0WARN_ON(1);
>
> =A0 =A0 =A0 =A0parent_nritems =3D btrfs_header_nritems(parent);
> @@ -898,7 +899,7 @@ static noinline int balance_level(struct btrfs_tr=
ans_handle *trans,
>
> =A0 =A0 =A0 =A0WARN_ON(path->locks[level] !=3D BTRFS_WRITE_LOCK &&
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0path->locks[level] !=3D BTRFS_WRITE_LO=
CK_BLOCKING);
> - =A0 =A0 =A0 WARN_ON(btrfs_header_generation(mid) !=3D trans->transi=
d);
> + =A0 =A0 =A0 WARN_ON(btrfs_header_generation(mid) < trans->transacti=
on->transid);
>
> =A0 =A0 =A0 =A0orig_ptr =3D btrfs_node_blockptr(mid, orig_slot);
>
> @@ -1105,7 +1106,7 @@ static noinline int push_nodes_for_insert(struc=
t btrfs_trans_handle *trans,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return 1;
>
> =A0 =A0 =A0 =A0mid =3D path->nodes[level];
> - =A0 =A0 =A0 WARN_ON(btrfs_header_generation(mid) !=3D trans->transi=
d);
> + =A0 =A0 =A0 WARN_ON(btrfs_header_generation(mid) < trans->transacti=
on->transid);
>
> =A0 =A0 =A0 =A0if (level < BTRFS_MAX_LEVEL - 1)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0parent =3D path->nodes[level + 1];
> @@ -1942,8 +1943,8 @@ static int push_node_left(struct btrfs_trans_ha=
ndle *trans,
> =A0 =A0 =A0 =A0src_nritems =3D btrfs_header_nritems(src);
> =A0 =A0 =A0 =A0dst_nritems =3D btrfs_header_nritems(dst);
> =A0 =A0 =A0 =A0push_items =3D BTRFS_NODEPTRS_PER_BLOCK(root) - dst_nr=
items;
> - =A0 =A0 =A0 WARN_ON(btrfs_header_generation(src) !=3D trans->transi=
d);
> - =A0 =A0 =A0 WARN_ON(btrfs_header_generation(dst) !=3D trans->transi=
d);
> + =A0 =A0 =A0 WARN_ON(btrfs_header_generation(src) < trans->transacti=
on->transid);
> + =A0 =A0 =A0 WARN_ON(btrfs_header_generation(dst) < trans->transacti=
on->transid);
>
> =A0 =A0 =A0 =A0if (!empty && src_nritems <=3D 8)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return 1;
> @@ -2005,8 +2006,8 @@ static int balance_node_right(struct btrfs_tran=
s_handle *trans,
> =A0 =A0 =A0 =A0int dst_nritems;
> =A0 =A0 =A0 =A0int ret =3D 0;
>
> - =A0 =A0 =A0 WARN_ON(btrfs_header_generation(src) !=3D trans->transi=
d);
> - =A0 =A0 =A0 WARN_ON(btrfs_header_generation(dst) !=3D trans->transi=
d);
> + =A0 =A0 =A0 WARN_ON(btrfs_header_generation(src) < trans->transacti=
on->transid);
> + =A0 =A0 =A0 WARN_ON(btrfs_header_generation(dst) < trans->transacti=
on->transid);
>
> =A0 =A0 =A0 =A0src_nritems =3D btrfs_header_nritems(src);
> =A0 =A0 =A0 =A0dst_nritems =3D btrfs_header_nritems(dst);
> @@ -2097,7 +2098,7 @@ static noinline int insert_new_root(struct btrf=
s_trans_handle *trans,
> =A0 =A0 =A0 =A0btrfs_set_node_key(c, &lower_key, 0);
> =A0 =A0 =A0 =A0btrfs_set_node_blockptr(c, 0, lower->start);
> =A0 =A0 =A0 =A0lower_gen =3D btrfs_header_generation(lower);
> - =A0 =A0 =A0 WARN_ON(lower_gen !=3D trans->transid);
> + =A0 =A0 =A0 WARN_ON(lower_gen < trans->transaction->transid);
>
> =A0 =A0 =A0 =A0btrfs_set_node_ptr_generation(c, 0, lower_gen);
>
> @@ -2177,7 +2178,7 @@ static noinline int split_node(struct btrfs_tra=
ns_handle *trans,
> =A0 =A0 =A0 =A0u32 c_nritems;
>
> =A0 =A0 =A0 =A0c =3D path->nodes[level];
> - =A0 =A0 =A0 WARN_ON(btrfs_header_generation(c) !=3D trans->transid)=
;
> + =A0 =A0 =A0 WARN_ON(btrfs_header_generation(c) < trans->transaction=
->transid);
> =A0 =A0 =A0 =A0if (c =3D=3D root->node) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/* trying to split the root, lets make=
 a new one */
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ret =3D insert_new_root(trans, root, p=
ath, level + 1);
> @@ -3751,7 +3752,7 @@ static noinline int btrfs_del_leaf(struct btrfs=
_trans_handle *trans,
> =A0{
> =A0 =A0 =A0 =A0int ret;
>
> - =A0 =A0 =A0 WARN_ON(btrfs_header_generation(leaf) !=3D trans->trans=
id);
> + =A0 =A0 =A0 WARN_ON(btrfs_header_generation(leaf) < trans->transact=
ion->transid);
> =A0 =A0 =A0 =A0ret =3D del_ptr(trans, root, path, 1, path->slots[1]);
> =A0 =A0 =A0 =A0if (ret)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return ret;
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index a6263bd..310f586 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -925,6 +925,7 @@ struct btrfs_fs_info {
> =A0 =A0 =A0 =A0struct mutex durable_block_rsv_mutex;
>
> =A0 =A0 =A0 =A0u64 generation;
> + =A0 =A0 =A0 u64 sub_generation;
> =A0 =A0 =A0 =A0u64 last_trans_committed;
>
> =A0 =A0 =A0 =A0/*
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 94ecac3..50e74b1 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1035,7 +1035,7 @@ int clean_tree_block(struct btrfs_trans_handle =
*trans, struct btrfs_root *root,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct extent_buffer *buf)
> =A0{
> =A0 =A0 =A0 =A0struct inode *btree_inode =3D root->fs_info->btree_ino=
de;
> - =A0 =A0 =A0 if (btrfs_header_generation(buf) =3D=3D
> + =A0 =A0 =A0 if (btrfs_header_generation(buf) >=3D
> =A0 =A0 =A0 =A0 =A0 =A0root->fs_info->running_transaction->transid) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0btrfs_assert_tree_locked(buf);
>
> @@ -1559,7 +1559,7 @@ static int transaction_kthread(void *arg)
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0trans =3D btrfs_join_transaction(root)=
;
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0BUG_ON(IS_ERR(trans));
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (transid =3D=3D trans->transid) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (transid =3D=3D trans->transaction->=
transid) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ret =3D btrfs_commit_t=
ransaction(trans, root);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0BUG_ON(ret);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0} else {
> @@ -2001,6 +2001,7 @@ struct btrfs_root *open_ctree(struct super_bloc=
k *sb,
> =A0 =A0 =A0 =A0csum_root->track_dirty =3D 1;
>
> =A0 =A0 =A0 =A0fs_info->generation =3D generation;
> + =A0 =A0 =A0 fs_info->sub_generation =3D generation;
> =A0 =A0 =A0 =A0fs_info->last_trans_committed =3D generation;
> =A0 =A0 =A0 =A0fs_info->data_alloc_profile =3D (u64)-1;
> =A0 =A0 =A0 =A0fs_info->metadata_alloc_profile =3D (u64)-1;
> @@ -2671,7 +2672,7 @@ void btrfs_mark_buffer_dirty(struct extent_buff=
er *buf)
> =A0 =A0 =A0 =A0int was_dirty;
>
> =A0 =A0 =A0 =A0btrfs_assert_tree_locked(buf);
> - =A0 =A0 =A0 if (transid !=3D root->fs_info->generation) {
> + =A0 =A0 =A0 if (transid < root->fs_info->generation) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0printk(KERN_CRIT "btrfs transid mismat=
ch buffer %llu, "
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 "found %llu running %llu\=
n",
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(unsigned long long)bu=
f->start,
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 66bac22..fa101ab 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4363,7 +4363,7 @@ int btrfs_finish_extent_commit(struct btrfs_tra=
ns_handle *trans,
> =A0 =A0 =A0 =A0list_for_each_entry_safe(block_rsv, next_rsv,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 &fs_i=
nfo->durable_block_rsv_list, list) {
>
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 idx =3D trans->transid & 0x1;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 idx =3D trans->transaction->transid & 0=
x1;
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (block_rsv->freed[idx] > 0) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0block_rsv_add_bytes(bl=
ock_rsv,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0block_rsv->freed[idx], 0);
> @@ -4680,7 +4680,7 @@ void btrfs_free_tree_block(struct btrfs_trans_h=
andle *trans,
> =A0 =A0 =A0 =A0if (block_rsv->space_info !=3D cache->space_info)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0goto out;
>
> - =A0 =A0 =A0 if (btrfs_header_generation(buf) =3D=3D trans->transid)=
 {
> + =A0 =A0 =A0 if (btrfs_header_generation(buf) >=3D trans->transactio=
n->transid) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (root->root_key.objectid !=3D BTRFS=
_TREE_LOG_OBJECTID) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ret =3D check_ref_clea=
nup(trans, root, buf->start);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (!ret)
> @@ -4730,7 +4730,8 @@ pin:
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (ret) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0spin_lock(&block_rsv->=
lock);
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 block_rsv->freed[trans-=
>transid & 0x1] +=3D buf->len;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 block_rsv->freed[trans-=
>transaction->transid & 0x1] +=3D
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
buf->len;
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0spin_unlock(&block_rsv=
->lock);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}
> =A0 =A0 =A0 =A0}
> @@ -6164,7 +6165,8 @@ static noinline int walk_up_proc(struct btrfs_t=
rans_handle *trans,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/* make block locked assertion in clea=
n_tree_block happy */
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (!path->locks[level] &&
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 btrfs_header_generation(eb) =3D=
=3D trans->transid) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 btrfs_header_generation(eb) >=3D
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0 =A0 =A0trans->transaction->transid) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0btrfs_tree_lock(eb);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0btrfs_set_lock_blockin=
g(eb);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0path->locks[level] =3D=
 BTRFS_WRITE_LOCK_BLOCKING;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 34195f9..99eb0b3 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2106,7 +2106,7 @@ void btrfs_orphan_pre_snapshot(struct btrfs_tra=
ns_handle *trans,
> =A0 =A0 =A0 =A0 * space than it frees. So we should make sure there i=
s enough
> =A0 =A0 =A0 =A0 * reserved space.
> =A0 =A0 =A0 =A0 */
> - =A0 =A0 =A0 index =3D trans->transid & 0x1;
> + =A0 =A0 =A0 index =3D trans->transaction->transid & 0x1;
> =A0 =A0 =A0 =A0if (block_rsv->reserved + block_rsv->freed[index] < bl=
ock_rsv->size) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0num_bytes +=3D block_rsv->size -
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (block_rsv->r=
eserved + block_rsv->freed[index]);
> @@ -2130,7 +2130,7 @@ void btrfs_orphan_post_snapshot(struct btrfs_tr=
ans_handle *trans,
>
> =A0 =A0 =A0 =A0/* refill source subvolume's orphan block reservation =
*/
> =A0 =A0 =A0 =A0block_rsv =3D root->orphan_block_rsv;
> - =A0 =A0 =A0 index =3D trans->transid & 0x1;
> + =A0 =A0 =A0 index =3D trans->transaction->transid & 0x1;
> =A0 =A0 =A0 =A0if (block_rsv->reserved + block_rsv->freed[index] < bl=
ock_rsv->size) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0num_bytes =3D block_rsv->size -
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(block_rsv->re=
served + block_rsv->freed[index]);
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 2bb0886..bc9a2ad 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -2758,7 +2758,7 @@ static noinline long btrfs_ioctl_start_sync(str=
uct file *file, void __user *argp
> =A0 =A0 =A0 =A0trans =3D btrfs_start_transaction(root, 0);
> =A0 =A0 =A0 =A0if (IS_ERR(trans))
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return PTR_ERR(trans);
> - =A0 =A0 =A0 transid =3D trans->transid;
> + =A0 =A0 =A0 transid =3D trans->transaction->transid;
> =A0 =A0 =A0 =A0ret =3D btrfs_commit_transaction_async(trans, root, 0)=
;
> =A0 =A0 =A0 =A0if (ret) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0btrfs_end_transaction(trans, root);
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 59bb176..3063be1 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -469,7 +469,7 @@ static int update_backref_cache(struct btrfs_tran=
s_handle *trans,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return 0;
> =A0 =A0 =A0 =A0}
>
> - =A0 =A0 =A0 if (cache->last_trans =3D=3D trans->transid)
> + =A0 =A0 =A0 if (cache->last_trans >=3D trans->transaction->transid)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return 0;
>
> =A0 =A0 =A0 =A0/*
> @@ -1281,7 +1281,7 @@ static struct btrfs_root *create_reloc_root(str=
uct btrfs_trans_handle *trans,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0BUG_ON(ret);
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0btrfs_set_root_last_snapshot(&root->ro=
ot_item,
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0trans->transid - 1);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0trans->transaction->transid - 1);
> =A0 =A0 =A0 =A0} else {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/*
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 * called by btrfs_reloc_post_snapshot=
_hook.
> @@ -2271,7 +2271,7 @@ static int record_reloc_root_in_trans(struct bt=
rfs_trans_handle *trans,
> =A0{
> =A0 =A0 =A0 =A0struct btrfs_root *root;
>
> - =A0 =A0 =A0 if (reloc_root->last_trans =3D=3D trans->transid)
> + =A0 =A0 =A0 if (reloc_root->last_trans >=3D trans->transaction->tra=
nsid)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return 0;
>
> =A0 =A0 =A0 =A0root =3D read_fs_root(reloc_root->fs_info, reloc_root-=
>root_key.offset);
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 7dc36fa..531b0dc 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -113,7 +113,9 @@ static noinline int join_transaction(struct btrfs=
_root *root, int nofail)
> =A0 =A0 =A0 =A0extent_io_tree_init(&cur_trans->dirty_pages,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 root->fs_info=
->btree_inode->i_mapping);
> =A0 =A0 =A0 =A0root->fs_info->generation++;
> + =A0 =A0 =A0 root->fs_info->sub_generation =3D root->fs_info->genera=
tion;
> =A0 =A0 =A0 =A0cur_trans->transid =3D root->fs_info->generation;
> + =A0 =A0 =A0 cur_trans->sub_transid =3D cur_trans->transid;
> =A0 =A0 =A0 =A0root->fs_info->running_transaction =3D cur_trans;
> =A0 =A0 =A0 =A0spin_unlock(&root->fs_info->trans_lock);
>
> @@ -129,7 +131,7 @@ static noinline int join_transaction(struct btrfs=
_root *root, int nofail)
> =A0static int record_root_in_trans(struct btrfs_trans_handle *trans,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct bt=
rfs_root *root)
> =A0{
> - =A0 =A0 =A0 if (root->ref_cows && root->last_trans < trans->transid=
) {
> + =A0 =A0 =A0 if (root->ref_cows && root->last_trans < trans->transac=
tion->transid) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0WARN_ON(root =3D=3D root->fs_info->ext=
ent_root);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0WARN_ON(root->commit_root !=3D root->n=
ode);
>
> @@ -146,7 +148,7 @@ static int record_root_in_trans(struct btrfs_tran=
s_handle *trans,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0smp_wmb();
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0spin_lock(&root->fs_info->fs_roots_rad=
ix_lock);
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (root->last_trans =3D=3D trans->tran=
sid) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (root->last_trans >=3D trans->transa=
ction->transid) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0spin_unlock(&root->fs_=
info->fs_roots_radix_lock);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return 0;
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}
> @@ -194,7 +196,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans=
_handle *trans,
> =A0 =A0 =A0 =A0 * and barriers
> =A0 =A0 =A0 =A0 */
> =A0 =A0 =A0 =A0smp_rmb();
> - =A0 =A0 =A0 if (root->last_trans =3D=3D trans->transid &&
> + =A0 =A0 =A0 if (root->last_trans >=3D trans->transaction->transid &=
&
> =A0 =A0 =A0 =A0 =A0 =A0!root->in_trans_setup)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return 0;
>
> @@ -302,7 +304,7 @@ again:
>
> =A0 =A0 =A0 =A0cur_trans =3D root->fs_info->running_transaction;
>
> - =A0 =A0 =A0 h->transid =3D cur_trans->transid;
> + =A0 =A0 =A0 h->transid =3D cur_trans->sub_transid;
> =A0 =A0 =A0 =A0h->transaction =3D cur_trans;
> =A0 =A0 =A0 =A0h->blocks_used =3D 0;
> =A0 =A0 =A0 =A0h->bytes_reserved =3D 0;
> @@ -1346,6 +1348,7 @@ int btrfs_commit_transaction(struct btrfs_trans=
_handle *trans,
>
> =A0 =A0 =A0 =A0trans->transaction->blocked =3D 0;
> =A0 =A0 =A0 =A0spin_lock(&root->fs_info->trans_lock);
> + =A0 =A0 =A0 root->fs_info->generation =3D cur_trans->sub_transid;
> =A0 =A0 =A0 =A0root->fs_info->running_transaction =3D NULL;
> =A0 =A0 =A0 =A0root->fs_info->trans_no_join =3D 0;
> =A0 =A0 =A0 =A0spin_unlock(&root->fs_info->trans_lock);
> @@ -1367,7 +1370,7 @@ int btrfs_commit_transaction(struct btrfs_trans=
_handle *trans,
>
> =A0 =A0 =A0 =A0cur_trans->commit_done =3D 1;
>
> - =A0 =A0 =A0 root->fs_info->last_trans_committed =3D cur_trans->tran=
sid;
> + =A0 =A0 =A0 root->fs_info->last_trans_committed =3D cur_trans->sub_=
transid;
>
> =A0 =A0 =A0 =A0wake_up(&cur_trans->commit_wait);
>
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 02564e6..45876b0 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -23,6 +23,7 @@
>
> =A0struct btrfs_transaction {
> =A0 =A0 =A0 =A0u64 transid;
> + =A0 =A0 =A0 u64 sub_transid;
> =A0 =A0 =A0 =A0/*
> =A0 =A0 =A0 =A0 * total writers in this transaction, it must be zero =
before the
> =A0 =A0 =A0 =A0 * transaction can end
> diff --git a/fs/btrfs/tree-defrag.c b/fs/btrfs/tree-defrag.c
> index 3b580ee..a2569af 100644
> --- a/fs/btrfs/tree-defrag.c
> +++ b/fs/btrfs/tree-defrag.c
> @@ -139,7 +139,7 @@ done:
> =A0 =A0 =A0 =A0if (ret !=3D -EAGAIN) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0memset(&root->defrag_progress, 0,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 sizeof(root->defrag_progr=
ess));
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 root->defrag_trans_start =3D trans->tra=
nsid;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 root->defrag_trans_start =3D trans->tra=
nsaction->transid;
> =A0 =A0 =A0 =A0}
> =A0 =A0 =A0 =A0return ret;
> =A0}
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index f320641..5a026d2 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -134,9 +134,19 @@ static noinline int replay_dir_deletes(struct bt=
rfs_trans_handle *trans,
> =A0static int start_log_trans(struct btrfs_trans_handle *trans,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct btrfs_root=
 *root)
> =A0{
> + =A0 =A0 =A0 struct btrfs_transaction *cur_trans;
> =A0 =A0 =A0 =A0int ret;
> =A0 =A0 =A0 =A0int err =3D 0;
>
> + =A0 =A0 =A0 /* start a new sub transaction */
> + =A0 =A0 =A0 spin_lock(&root->fs_info->trans_lock);
> +
> + =A0 =A0 =A0 cur_trans =3D root->fs_info->running_transaction;
> + =A0 =A0 =A0 cur_trans->sub_transid++;
> + =A0 =A0 =A0 root->fs_info->sub_generation =3D cur_trans->sub_transi=
d;
> +
> + =A0 =A0 =A0 spin_unlock(&root->fs_info->trans_lock);
> +
> =A0 =A0 =A0 =A0mutex_lock(&root->log_mutex);
> =A0 =A0 =A0 =A0if (root->log_root) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (!root->log_start_pid) {
> @@ -2007,7 +2017,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *t=
rans,
> =A0 =A0 =A0 =A0}
>
> =A0 =A0 =A0 =A0/* bail out if we need to do a full commit */
> - =A0 =A0 =A0 if (root->fs_info->last_trans_log_full_commit =3D=3D tr=
ans->transid) {
> + =A0 =A0 =A0 if (root->fs_info->last_trans_log_full_commit >=3D
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0 =A0 trans->transaction->transid) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ret =3D -EAGAIN;
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0mutex_unlock(&root->log_mutex);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0goto out;
> @@ -2084,7 +2095,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *t=
rans,
> =A0 =A0 =A0 =A0 * now that we've moved on to the tree of log tree roo=
ts,
> =A0 =A0 =A0 =A0 * check the full commit flag again
> =A0 =A0 =A0 =A0 */
> - =A0 =A0 =A0 if (root->fs_info->last_trans_log_full_commit =3D=3D tr=
ans->transid) {
> + =A0 =A0 =A0 if (root->fs_info->last_trans_log_full_commit >=3D
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0 =A0 trans->transaction->transid) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0btrfs_wait_marked_extents(log, &log->d=
irty_log_pages, mark);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0mutex_unlock(&log_root_tree->log_mutex=
);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ret =3D -EAGAIN;
> --
> 1.6.5.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs=
" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>

Portions of this patch conflict with another patch recently posted to
the Mailing List by Josef.

http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg11657.html

"[PATCH] Btrfs: kill the durable block rsv stuff" removes portions of
the functions btrfs_finish_extent_commit() and btrfs_free_tree_block()
in fs/btrfs/extent-tree.c that are modified in this patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 00/12 v5] Btrfs: improve write ahead log with sub transaction
  2011-08-06  9:37 [PATCH 00/12 v5] Btrfs: improve write ahead log with sub transaction Liu Bo
                   ` (11 preceding siblings ...)
  2011-08-06  9:37 ` [PATCH 12/12 v5] Btrfs: use the right generation number to read log_root_tree Liu Bo
@ 2011-09-01 17:38 ` Mitch Harder
  2011-09-02  0:42   ` Liu Bo
  12 siblings, 1 reply; 18+ messages in thread
From: Mitch Harder @ 2011-09-01 17:38 UTC (permalink / raw)
  To: Liu Bo; +Cc: linux-btrfs, chris.mason

On Sat, Aug 6, 2011 at 4:37 AM, Liu Bo <liubo2009@cn.fujitsu.com> wrote=
:
> I've fixed a bug and rebased this to the latest for-linus branch,
> and with applying my previous posted patch:
>
> =A0 =A0 =A0 =A0[PATCH] Btrfs: fix an oops of log replay
>
> , I also test this sub transaction patchset with
> a) sysbench 0.4.12 tool and
> b) Chris's synctest tool in both _crash_ and _uncrash_ cases, and it =
works well.
>
> Please test this and feel free to notice me if there are any problems=
=2E
> Hope that it can get through with no bugs and be ready for merge this=
 time :)
>
> =3D=3D=3D
> I've been working to try to improve the write-ahead log's performance=
,
> and I found that the bottleneck addresses in the checksum items,
> especially when we want to make a random write on a large file, e.g a=
 4G file.
>
> Then a idea for this suggested by Chris is to use sub transaction ids=
 and just
> to log the part of inode that had changed since either the last log c=
ommit or
> the last transaction commit. =A0And as we also push the sub transid i=
nto the btree
> blocks, we'll get much faster tree walks. =A0As a result, we abandon =
the original
> brute force approach, which is "to delete all items of the inode in l=
og",
> to making sure we get the most uptodate copies of everything, and ins=
tead
> we manage to "find and merge", i.e. finding extents in the log tree a=
nd merging
> in the new extents from the file.
>
> This patchset puts the above idea into code, and although the code is=
 now more
> complex, it brings us a great deal of performance improvement:
>
> in my sysbench "write + fsync" test:
>
> =A0 =A0 =A0 =A0451.01Kb/sec -> 4.3621Mb/sec
>
> In v2, thanks to Chris, we worked together to solve 2 bugs, and after=
 that it
> works as expected.
> In v3, thanks to Josef, we simplify several code.
> In v4, rebase to the latest for-linus branch, Chris hit two problems,=
 and we
> solve them.
>
> Since there are some vital changes in recent rc, like "kill trans_mut=
ex" and
> "use cur_trans", as David asked, I rebase the patchset to the latest =
for-linus
> branch.
>
> More tests are welcome!
>
>
> Liu Bo (12):
> =A0Revert "Btrfs: do not flush csum items of unchanged file data duri=
ng
> =A0 =A0treelog"
> =A0Btrfs: introduce sub transaction stuff
> =A0Btrfs: update block generation if should_cow_block fails
> =A0Btrfs: modify btrfs_drop_extents API
> =A0Btrfs: introduce first sub trans
> =A0Btrfs: still update inode trans stuff when size remains unchanged
> =A0Btrfs: improve log with sub transaction
> =A0Btrfs: add checksum check for log
> =A0Btrfs: fix a bug of log check
> =A0Btrfs: kick off useless code
> =A0Btrfs: do not iput inode when inode is still in log
> =A0Btrfs: use the right generation number to read log_root_tree
>
> =A0fs/btrfs/btrfs_inode.h | =A0 12 ++-
> =A0fs/btrfs/ctree.c =A0 =A0 =A0 | =A0 87 +++++++++++++------
> =A0fs/btrfs/ctree.h =A0 =A0 =A0 | =A0 =A05 +-
> =A0fs/btrfs/disk-io.c =A0 =A0 | =A0 23 ++++--
> =A0fs/btrfs/extent-tree.c | =A0 10 ++-
> =A0fs/btrfs/file.c =A0 =A0 =A0 =A0| =A0 22 ++---
> =A0fs/btrfs/inode.c =A0 =A0 =A0 | =A0 39 ++++++---
> =A0fs/btrfs/ioctl.c =A0 =A0 =A0 | =A0 =A06 +-
> =A0fs/btrfs/relocation.c =A0| =A0 =A06 +-
> =A0fs/btrfs/transaction.c | =A0 13 ++-
> =A0fs/btrfs/transaction.h | =A0 19 ++++-
> =A0fs/btrfs/tree-defrag.c | =A0 =A02 +-
> =A0fs/btrfs/tree-log.c =A0 =A0| =A0225 ++++++++++++++++++++++++++++++=
++----------------
> =A013 files changed, 312 insertions(+), 157 deletions(-)
>

I've had the v5 stack of patches in my kernel for about 3 weeks now.

I've just been testing for general stability in a 3.0 series kernel,
and I haven't run across any issues or obvious performance effects.

I've been testing on both x86 and x86_64 installations in Desktop
service without RAID.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 00/12 v5] Btrfs: improve write ahead log with sub transaction
  2011-09-01 17:38 ` [PATCH 00/12 v5] Btrfs: improve write ahead log with sub transaction Mitch Harder
@ 2011-09-02  0:42   ` Liu Bo
  0 siblings, 0 replies; 18+ messages in thread
From: Liu Bo @ 2011-09-02  0:42 UTC (permalink / raw)
  To: Mitch Harder; +Cc: linux-btrfs, chris.mason

On 09/02/2011 01:38 AM, Mitch Harder wrote:
> On Sat, Aug 6, 2011 at 4:37 AM, Liu Bo <liubo2009@cn.fujitsu.com> wrote:
>> I've fixed a bug and rebased this to the latest for-linus branch,
>> and with applying my previous posted patch:
>>
>>        [PATCH] Btrfs: fix an oops of log replay
>>
>> , I also test this sub transaction patchset with
>> a) sysbench 0.4.12 tool and
>> b) Chris's synctest tool in both _crash_ and _uncrash_ cases, and it works well.
>>
>> Please test this and feel free to notice me if there are any problems.
>> Hope that it can get through with no bugs and be ready for merge this time :)
>>
>> ===
>> I've been working to try to improve the write-ahead log's performance,
>> and I found that the bottleneck addresses in the checksum items,
>> especially when we want to make a random write on a large file, e.g a 4G file.
>>
>> Then a idea for this suggested by Chris is to use sub transaction ids and just
>> to log the part of inode that had changed since either the last log commit or
>> the last transaction commit.  And as we also push the sub transid into the btree
>> blocks, we'll get much faster tree walks.  As a result, we abandon the original
>> brute force approach, which is "to delete all items of the inode in log",
>> to making sure we get the most uptodate copies of everything, and instead
>> we manage to "find and merge", i.e. finding extents in the log tree and merging
>> in the new extents from the file.
>>
>> This patchset puts the above idea into code, and although the code is now more
>> complex, it brings us a great deal of performance improvement:
>>
>> in my sysbench "write + fsync" test:
>>
>>        451.01Kb/sec -> 4.3621Mb/sec
>>
>> In v2, thanks to Chris, we worked together to solve 2 bugs, and after that it
>> works as expected.
>> In v3, thanks to Josef, we simplify several code.
>> In v4, rebase to the latest for-linus branch, Chris hit two problems, and we
>> solve them.
>>
>> Since there are some vital changes in recent rc, like "kill trans_mutex" and
>> "use cur_trans", as David asked, I rebase the patchset to the latest for-linus
>> branch.
>>
>> More tests are welcome!
>>
>>
>> Liu Bo (12):
>>  Revert "Btrfs: do not flush csum items of unchanged file data during
>>    treelog"
>>  Btrfs: introduce sub transaction stuff
>>  Btrfs: update block generation if should_cow_block fails
>>  Btrfs: modify btrfs_drop_extents API
>>  Btrfs: introduce first sub trans
>>  Btrfs: still update inode trans stuff when size remains unchanged
>>  Btrfs: improve log with sub transaction
>>  Btrfs: add checksum check for log
>>  Btrfs: fix a bug of log check
>>  Btrfs: kick off useless code
>>  Btrfs: do not iput inode when inode is still in log
>>  Btrfs: use the right generation number to read log_root_tree
>>
>>  fs/btrfs/btrfs_inode.h |   12 ++-
>>  fs/btrfs/ctree.c       |   87 +++++++++++++------
>>  fs/btrfs/ctree.h       |    5 +-
>>  fs/btrfs/disk-io.c     |   23 ++++--
>>  fs/btrfs/extent-tree.c |   10 ++-
>>  fs/btrfs/file.c        |   22 ++---
>>  fs/btrfs/inode.c       |   39 ++++++---
>>  fs/btrfs/ioctl.c       |    6 +-
>>  fs/btrfs/relocation.c  |    6 +-
>>  fs/btrfs/transaction.c |   13 ++-
>>  fs/btrfs/transaction.h |   19 ++++-
>>  fs/btrfs/tree-defrag.c |    2 +-
>>  fs/btrfs/tree-log.c    |  225 ++++++++++++++++++++++++++++++++----------------
>>  13 files changed, 312 insertions(+), 157 deletions(-)
>>
> 
> I've had the v5 stack of patches in my kernel for about 3 weeks now.
> 
> I've just been testing for general stability in a 3.0 series kernel,
> and I haven't run across any issues or obvious performance effects.
> 
> I've been testing on both x86 and x86_64 installations in Desktop
> service without RAID.

That's great.
Mitch, thanks a lot for the testing work!

thanks,
liubo

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 11/12 v5] Btrfs: do not iput inode when inode is still in log
  2011-08-06  9:37 ` [PATCH 11/12 v5] Btrfs: do not iput inode when inode is still in log Liu Bo
@ 2011-10-17  0:30   ` Chris Mason
  2011-10-17  5:22     ` Liu Bo
  0 siblings, 1 reply; 18+ messages in thread
From: Chris Mason @ 2011-10-17  0:30 UTC (permalink / raw)
  To: Liu Bo; +Cc: linux-btrfs

Excerpts from Liu Bo's message of 2011-08-06 05:37:46 -0400:
> We maintain the inode's logged_trans to avoid reloging it, but if we iput
> the inode and reread it, we'll get logged_trans to zero.
> 
> So when an inode is still in log tree, and transaction is not committed yet,
> we do not iput the inode.

I know you've tried a few different methods for this, but I think this
one will end up leading to OOM, since it pins the inodes in ram and we
don't have a way for the inode shrinker to force a commit.

So I took this code out and couldn't trigger my eexist oops anymore.
I'm going through right now to make sure this isn't because someone took
the BUG_ON out ;)

The big problem was the code wasn't expecting to find previously logged
items for the inode because the last trans field wasn't set.  It seems
like we should just be able to deal with these eexist returns without
any real trouble?

-chris

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 11/12 v5] Btrfs: do not iput inode when inode is still in log
  2011-10-17  0:30   ` Chris Mason
@ 2011-10-17  5:22     ` Liu Bo
  0 siblings, 0 replies; 18+ messages in thread
From: Liu Bo @ 2011-10-17  5:22 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

On 10/17/2011 08:30 AM, Chris Mason wrote:
> Excerpts from Liu Bo's message of 2011-08-06 05:37:46 -0400:
>> We maintain the inode's logged_trans to avoid reloging it, but if we iput
>> the inode and reread it, we'll get logged_trans to zero.
>>
>> So when an inode is still in log tree, and transaction is not committed yet,
>> we do not iput the inode.
> 
> I know you've tried a few different methods for this, but I think this
> one will end up leading to OOM, since it pins the inodes in ram and we
> don't have a way for the inode shrinker to force a commit.
> 

Ok, I see what the problem is.

> So I took this code out and couldn't trigger my eexist oops anymore.
> I'm going through right now to make sure this isn't because someone took
> the BUG_ON out ;)
> 

Right, without the patch, it should make copy_items()'s btrfs_insert_empty_items()
return -EEXIST.

> The big problem was the code wasn't expecting to find previously logged
> items for the inode because the last trans field wasn't set.  It seems

If the last trans field isn't set, it means the inode has not been modified, so
it does not need to find the logged things, does it?  Or am I missing something?

> like we should just be able to deal with these eexist returns without
> any real trouble?
> 

Agree, actually this patch came from Josef's advice that we could hold the inode
in cache till sometime is ok, but I have to apologize for not getting the OOM problems
into consideration.

So we can go back to tree lookup method, can you pick this one instead?

From: Liu Bo <liubo2009@cn.fujitsu.com>

[PATCH] Btrfs: deal with EEXIST after iput

There are two cases when BTRFS_I(inode)->logged_trans is zero:
a) an inode is just allocated;
b) iput an inode and reread it.

However, in b) if btrfs is not committed yet, and this inode _may_ still remain
in log tree.

So we need to check the log tree to get logged_trans a right value
in case it hits a EEXIST while logging.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
---
 fs/btrfs/inode.c    |    9 +++------
 fs/btrfs/tree-log.c |   43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8db16fa..e310b5b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1770,12 +1770,9 @@ static int btrfs_finish_ordered_io(struct inode *inode, u64 start, u64 end)
 	add_pending_csums(trans, inode, ordered_extent->file_offset,
 			  &ordered_extent->list);
 
-	ret = btrfs_ordered_update_i_size(inode, 0, ordered_extent);
-	if (!ret) {
-		ret = btrfs_update_inode(trans, root, inode);
-		BUG_ON(ret);
-	} else
-		btrfs_set_inode_last_trans(trans, inode);
+	btrfs_ordered_update_i_size(inode, 0, ordered_extent);
+	ret = btrfs_update_inode(trans, root, inode);
+	BUG_ON(ret);
 	ret = 0;
 out:
 	if (nolock) {
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 8bedfb8..fea4f39 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3035,6 +3035,37 @@ out:
 	return ret;
 }
 
+static int check_logged_trans(struct btrfs_trans_handle *trans,
+			      struct btrfs_root *root, struct inode *inode)
+{
+	struct btrfs_inode_item *inode_item;
+	struct btrfs_path *path;
+	int ret;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	ret = btrfs_search_slot(trans, root,
+				&BTRFS_I(inode)->location, path, 0, 0);
+	if (ret) {
+		if (ret > 0)
+			ret = 0;
+		goto out;
+	}
+
+	btrfs_unlock_up_safe(path, 1);
+	inode_item = btrfs_item_ptr(path->nodes[0], path->slots[0],
+				    struct btrfs_inode_item);
+
+	BTRFS_I(inode)->logged_trans = btrfs_inode_transid(path->nodes[0],
+							   inode_item);
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+
 static int inode_in_log(struct btrfs_trans_handle *trans,
 		 struct inode *inode)
 {
@@ -3087,6 +3118,18 @@ int btrfs_log_inode_parent(struct btrfs_trans_handle *trans,
 	if (ret)
 		goto end_no_trans;
 
+	/*
+	 * After we iput a inode and reread it from disk, logged_trans is 0.
+	 * However, this inode _may_ still remain in log tree and not be
+	 * committed yet.
+	 * So we need to check the log tree to get logged_trans a right value.
+	 */
+	if (!BTRFS_I(inode)->logged_trans && root->log_root) {
+		ret = check_logged_trans(trans, root->log_root, inode);
+		if (ret)
+			goto end_no_trans;
+	}
+
 	if (inode_in_log(trans, inode)) {
 		ret = BTRFS_NO_LOG_SYNC;
 		goto end_no_trans;
-- 
1.6.5.2


> -chris
> 


^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2011-10-17  5:22 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-06  9:37 [PATCH 00/12 v5] Btrfs: improve write ahead log with sub transaction Liu Bo
2011-08-06  9:37 ` [PATCH 01/12 v5] Revert "Btrfs: do not flush csum items of unchanged file data during treelog" Liu Bo
2011-08-06  9:37 ` [PATCH 02/12 v5] Btrfs: introduce sub transaction stuff Liu Bo
2011-08-09 17:25   ` Mitch Harder
2011-08-06  9:37 ` [PATCH 03/12 v5] Btrfs: update block generation if should_cow_block fails Liu Bo
2011-08-06  9:37 ` [PATCH 04/12 v5] Btrfs: modify btrfs_drop_extents API Liu Bo
2011-08-06  9:37 ` [PATCH 05/12 v5] Btrfs: introduce first sub trans Liu Bo
2011-08-06  9:37 ` [PATCH 06/12 v5] Btrfs: still update inode trans stuff when size remains unchanged Liu Bo
2011-08-06  9:37 ` [PATCH 07/12 v5] Btrfs: improve log with sub transaction Liu Bo
2011-08-06  9:37 ` [PATCH 08/12 v5] Btrfs: add checksum check for log Liu Bo
2011-08-06  9:37 ` [PATCH 09/12 v5] Btrfs: fix a bug of log check Liu Bo
2011-08-06  9:37 ` [PATCH 10/12 v5] Btrfs: kick off useless code Liu Bo
2011-08-06  9:37 ` [PATCH 11/12 v5] Btrfs: do not iput inode when inode is still in log Liu Bo
2011-10-17  0:30   ` Chris Mason
2011-10-17  5:22     ` Liu Bo
2011-08-06  9:37 ` [PATCH 12/12 v5] Btrfs: use the right generation number to read log_root_tree Liu Bo
2011-09-01 17:38 ` [PATCH 00/12 v5] Btrfs: improve write ahead log with sub transaction Mitch Harder
2011-09-02  0:42   ` Liu Bo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.