linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/2]btrfs: add two ioctls to do metadata readahead
@ 2010-07-14  8:02 Shaohua Li
  2010-07-14  8:26 ` Shaohua Li
  2010-07-19  8:03 ` Christoph Hellwig
  0 siblings, 2 replies; 7+ messages in thread
From: Shaohua Li @ 2010-07-14  8:02 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs, Arjan van de Ven, Wu, Fengguang

[-- Attachment #1: Type: text/plain, Size: 1255 bytes --]

Hi,
  We have file readahead to do asyn file read, but has no metadata
readahead. For a list of files, their metadata is stored in fragmented
disk space and metadata read is a sync operation, which impacts the
efficiency of readahead much. The patches try to add meatadata readahead
for btrfs.
  In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
the inode to a fd so we could use existing syscalls (readahead, mincore
or upcoming fincore) to do readahead, but the inode is hidden, there is
no easy way for this from my understanding. So we add two ioctls for
this. One is like readahead syscall, the other is like micore/fincore
syscall.
  Under a harddisk based netbook with Meego, the metadata readahead
reduced about 3.5s boot time from total 16s.

Issues:
1. it appears readahead metadata pages skipped checksum checking. I'm
still working on this.
2. in latest kernel, I got a lockdep warning. It looks not related to
the patches but I only observed it with the patches. The warning looks
like a false warning, as in my debug the spin_lock isn't hold. from my
understanding, all extent_buffer share a lockdep class and in the btree
lookup we might lock several extent_buffer. But I don't know how to fix
it yet.

Thanks,
Shaohua

[-- Attachment #2: log --]
[-- Type: text/plain, Size: 2918 bytes --]

[   88.260743] =============================================
[   88.262016] [ INFO: possible recursive locking detected ]
[   88.262669] 2.6.35-rc5-dirty #776
[   88.263298] ---------------------------------------------
[   88.263956] ra/714 is trying to acquire lock:
[   88.264515]  (&(&eb->lock)->rlock){+.+...}, at: [<ffffffffa004b9a4>] btrfs_try_spin_lock+0xa2/0x116 [btrfs]
[   88.264515]
[   88.264515] but task is already holding lock:
[   88.264515]  (&(&eb->lock)->rlock){+.+...}, at: [<ffffffffa004b8f9>] btrfs_clear_lock_blocking+0x20/0x29 [btrfs]
[   88.264515]
[   88.264515] other info that might help us debug this:
[   88.264515] 2 locks held by ra/714:
[   88.264515]  #0:  (&sb->s_type->i_mutex_key#14){+.+.+.}, at: [<ffffffff81137e64>] do_lookup+0xac/0x20c
[   88.264515]  #1:  (&(&eb->lock)->rlock){+.+...}, at: [<ffffffffa004b8f9>] btrfs_clear_lock_blocking+0x20/0x29 [btrfs]
[   88.264515]
[   88.264515] stack backtrace:
[   88.264515] Pid: 714, comm: ra Not tainted 2.6.35-rc5-dirty #776
[   88.264515] Call Trace:
[   88.264515]  [<ffffffff8109afc5>] __lock_acquire+0x153f/0x15d8
[   88.264515]  [<ffffffff81097769>] ? trace_hardirqs_off_caller+0x16/0x99
[   88.264515]  [<ffffffff8106bed9>] ? release_console_sem+0x1b5/0x1e6
[   88.264515]  [<ffffffff8174b3a7>] ? sub_preempt_count+0xe/0xb7
[   88.264515]  [<ffffffff8106c4f2>] ? vprintk+0x37e/0x3c2
[   88.264515]  [<ffffffffa004b9a4>] ? btrfs_try_spin_lock+0xa2/0x116 [btrfs]
[   88.264515]  [<ffffffff8109b1a6>] lock_acquire+0x148/0x18d
[   88.264515]  [<ffffffffa004b9a4>] ? btrfs_try_spin_lock+0xa2/0x116 [btrfs]
[   88.264515]  [<ffffffff81747798>] _raw_spin_lock+0x3b/0x4a
[   88.264515]  [<ffffffffa004b9a4>] ? btrfs_try_spin_lock+0xa2/0x116 [btrfs]
[   88.264515]  [<ffffffffa004b9a4>] btrfs_try_spin_lock+0xa2/0x116 [btrfs]
[   88.264515]  [<ffffffffa000a2f9>] btrfs_search_slot+0x78d/0x921 [btrfs]
[   88.264515]  [<ffffffffa000cd73>] ? __find_space_info+0x0/0xfb [btrfs]
[   88.264515]  [<ffffffffa001a7ff>] btrfs_lookup_inode+0x2f/0x8f [btrfs]
[   88.264515]  [<ffffffffa0028f67>] btrfs_iget+0xc3/0x418 [btrfs]
[   88.264515]  [<ffffffffa002b892>] btrfs_lookup_dentry+0x12f/0x3ff [btrfs]
[   88.264515]  [<ffffffff81141b03>] ? d_alloc+0x181/0x1d4
[   88.264515]  [<ffffffffa002bb78>] btrfs_lookup+0x16/0x2e [btrfs]
[   88.264515]  [<ffffffff81137eb4>] do_lookup+0xfc/0x20c
[   88.264515]  [<ffffffff8113960e>] do_last+0x1a1/0x5c0
[   88.264515]  [<ffffffff8113b6d6>] do_filp_open+0x1d2/0x5ed
[   88.264515]  [<ffffffff8114575f>] ? alloc_fd+0x3b/0x18e
[   88.264515]  [<ffffffff8174b43c>] ? sub_preempt_count+0xa3/0xb7
[   88.264515]  [<ffffffff81748073>] ? _raw_spin_unlock+0x35/0x52
[   88.264515]  [<ffffffff811458a0>] ? alloc_fd+0x17c/0x18e
[   88.264515]  [<ffffffff8112d113>] do_sys_open+0x63/0x116
[   88.264515]  [<ffffffff8112d1f9>] sys_open+0x20/0x22
[   88.264515]  [<ffffffff81031c1b>] system_call_fastpath+0x16/0x1b

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch 0/2]btrfs: add two ioctls to do metadata readahead
  2010-07-14  8:02 [patch 0/2]btrfs: add two ioctls to do metadata readahead Shaohua Li
@ 2010-07-14  8:26 ` Shaohua Li
  2010-07-19  5:43   ` Shaohua Li
  2010-07-19  8:03 ` Christoph Hellwig
  1 sibling, 1 reply; 7+ messages in thread
From: Shaohua Li @ 2010-07-14  8:26 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs, Arjan van de Ven, Wu, Fengguang

On Wed, Jul 14, 2010 at 04:02:16PM +0800, Shaohua Li wrote:
> Hi,
>   We have file readahead to do asyn file read, but has no metadata
> readahead. For a list of files, their metadata is stored in fragmented
> disk space and metadata read is a sync operation, which impacts the
> efficiency of readahead much. The patches try to add meatadata readahead
> for btrfs.
>   In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
> the inode to a fd so we could use existing syscalls (readahead, mincore
> or upcoming fincore) to do readahead, but the inode is hidden, there is
> no easy way for this from my understanding. So we add two ioctls for
> this. One is like readahead syscall, the other is like micore/fincore
> syscall.
>   Under a harddisk based netbook with Meego, the metadata readahead
> reduced about 3.5s boot time from total 16s.
> 
> Issues:
> 1. it appears readahead metadata pages skipped checksum checking. I'm
> still working on this.
> 2. in latest kernel, I got a lockdep warning. It looks not related to
> the patches but I only observed it with the patches. The warning looks
> like a false warning, as in my debug the spin_lock isn't hold. from my
> understanding, all extent_buffer share a lockdep class and in the btree
> lookup we might lock several extent_buffer. But I don't know how to fix
> it yet.
Ha, sorry, actually the two issues are one issue. We set lockdep level class
and doing checksum in one place. I have a debug patch for this and will
send out later. But before this, I'd like know your comments about the idea.

Thanks,
Shaohua 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch 0/2]btrfs: add two ioctls to do metadata readahead
  2010-07-14  8:26 ` Shaohua Li
@ 2010-07-19  5:43   ` Shaohua Li
  0 siblings, 0 replies; 7+ messages in thread
From: Shaohua Li @ 2010-07-19  5:43 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs, Arjan van de Ven, Wu, Fengguang

On Wed, Jul 14, 2010 at 04:26:45PM +0800, Shaohua Li wrote:
> On Wed, Jul 14, 2010 at 04:02:16PM +0800, Shaohua Li wrote:
> > Hi,
> >   We have file readahead to do asyn file read, but has no metadata
> > readahead. For a list of files, their metadata is stored in fragmented
> > disk space and metadata read is a sync operation, which impacts the
> > efficiency of readahead much. The patches try to add meatadata readahead
> > for btrfs.
> >   In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
> > the inode to a fd so we could use existing syscalls (readahead, mincore
> > or upcoming fincore) to do readahead, but the inode is hidden, there is
> > no easy way for this from my understanding. So we add two ioctls for
> > this. One is like readahead syscall, the other is like micore/fincore
> > syscall.
> >   Under a harddisk based netbook with Meego, the metadata readahead
> > reduced about 3.5s boot time from total 16s.
> > 
> > Issues:
> > 1. it appears readahead metadata pages skipped checksum checking. I'm
> > still working on this.
> > 2. in latest kernel, I got a lockdep warning. It looks not related to
> > the patches but I only observed it with the patches. The warning looks
> > like a false warning, as in my debug the spin_lock isn't hold. from my
> > understanding, all extent_buffer share a lockdep class and in the btree
> > lookup we might lock several extent_buffer. But I don't know how to fix
> > it yet.
> Ha, sorry, actually the two issues are one issue. We set lockdep level class
> and doing checksum in one place. I have a debug patch for this and will
> send out later. But before this, I'd like know your comments about the idea.
Hi Chris,
Below is the patch to fix checksum check skip issue. It's applied after the
two ioctl patches. I did stress test here and can't find any issue. can you
please take a look at the three patches.

Thanks,
Shaohua



Subject: [patch]btrfs: do validation for extent_buffer if it's skipped before

With metadata readahead, we slightly change the behavior. Before it, we allocate
an extent_buffer (so set page->private), do metadata read and
btree_readpage_end_io_hook() will do validation. After it, we directly do
metadata readahead, and since in this case page hasn't ->private,
btree_readpage_end_io_hook() will not do validation. This patch fixes this.
It addes a new flag to indicate if a buffer is validated. If not and even
the buffer is uptodated, we will do a validation.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>

---
 fs/btrfs/disk-io.c     |   67 ++++++++++++++++++++++++++++++-------------------
 fs/btrfs/disk-io.h     |    2 +
 fs/btrfs/extent-tree.c |    1 
 fs/btrfs/extent_io.c   |   14 ++++++++--
 fs/btrfs/extent_io.h   |    1 
 5 files changed, 58 insertions(+), 27 deletions(-)

Index: linux/fs/btrfs/disk-io.c
===================================================================
--- linux.orig/fs/btrfs/disk-io.c	2010-07-19 10:58:10.000000000 +0800
+++ linux/fs/btrfs/disk-io.c	2010-07-19 10:58:13.000000000 +0800
@@ -406,30 +406,18 @@ void btrfs_set_buffer_lockdep_class(stru
 }
 #endif
 
-static int btree_readpage_end_io_hook(struct page *page, u64 start, u64 end,
-			       struct extent_state *state)
+int btree_validate_extent_buffer(struct btrfs_root *root,
+	struct extent_buffer *eb)
 {
-	struct extent_io_tree *tree;
+	int ret = 0;
 	u64 found_start;
 	int found_level;
-	unsigned long len;
-	struct extent_buffer *eb;
-	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
-	int ret = 0;
-
-	tree = &BTRFS_I(page->mapping->host)->io_tree;
-	if (page->private == EXTENT_PAGE_PRIVATE)
-		goto out;
-	if (!page->private)
-		goto out;
-
-	len = page->private >> 2;
-	WARN_ON(len == 0);
 
-	eb = alloc_extent_buffer(tree, start, len, page, GFP_NOFS);
+	if (!test_bit(EXTENT_BUFFER_UNCHECKED, &eb->bflags))
+		return 0;
 
 	found_start = btrfs_header_bytenr(eb);
-	if (found_start != start) {
+	if (found_start != eb->start) {
 		if (printk_ratelimit()) {
 			printk(KERN_INFO "btrfs bad tree block start "
 			       "%llu %llu\n",
@@ -439,13 +427,7 @@ static int btree_readpage_end_io_hook(st
 		ret = -EIO;
 		goto err;
 	}
-	if (eb->first_page != page) {
-		printk(KERN_INFO "btrfs bad first page %lu %lu\n",
-		       eb->first_page->index, page->index);
-		WARN_ON(1);
-		ret = -EIO;
-		goto err;
-	}
+
 	if (check_tree_block_fsid(root, eb)) {
 		if (printk_ratelimit()) {
 			printk(KERN_INFO "btrfs bad fsid on block %llu\n",
@@ -461,6 +443,41 @@ static int btree_readpage_end_io_hook(st
 	ret = csum_tree_block(root, eb, 1);
 	if (ret)
 		ret = -EIO;
+err:
+	if (ret == 0)
+		clear_bit(EXTENT_BUFFER_UNCHECKED, &eb->bflags);
+	return ret;
+}
+
+static int btree_readpage_end_io_hook(struct page *page, u64 start, u64 end,
+			       struct extent_state *state)
+{
+	struct extent_io_tree *tree;
+	unsigned long len;
+	struct extent_buffer *eb;
+	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
+	int ret = 0;
+
+	tree = &BTRFS_I(page->mapping->host)->io_tree;
+	if (page->private == EXTENT_PAGE_PRIVATE)
+		goto out;
+	if (!page->private)
+		goto out;
+
+	len = page->private >> 2;
+	WARN_ON(len == 0);
+
+	eb = alloc_extent_buffer(tree, start, len, page, GFP_NOFS);
+
+	if (eb->first_page != page) {
+		printk(KERN_INFO "btrfs bad first page %lu %lu\n",
+		       eb->first_page->index, page->index);
+		WARN_ON(1);
+		ret = -EIO;
+		goto err;
+	}
+
+	ret = btree_validate_extent_buffer(root, eb);
 
 	end = min_t(u64, eb->len, PAGE_CACHE_SIZE);
 	end = eb->start + end - 1;
Index: linux/fs/btrfs/disk-io.h
===================================================================
--- linux.orig/fs/btrfs/disk-io.h	2010-07-19 10:56:33.000000000 +0800
+++ linux/fs/btrfs/disk-io.h	2010-07-19 10:58:13.000000000 +0800
@@ -36,6 +36,8 @@ static inline u64 btrfs_sb_offset(int mi
 struct btrfs_device;
 struct btrfs_fs_devices;
 
+int btree_validate_extent_buffer(struct btrfs_root *root,
+	struct extent_buffer *eb);
 struct extent_buffer *read_tree_block(struct btrfs_root *root, u64 bytenr,
 				      u32 blocksize, u64 parent_transid);
 int readahead_tree_block(struct btrfs_root *root, u64 bytenr, u32 blocksize,
Index: linux/fs/btrfs/extent_io.c
===================================================================
--- linux.orig/fs/btrfs/extent_io.c	2010-07-19 10:56:33.000000000 +0800
+++ linux/fs/btrfs/extent_io.c	2010-07-19 10:58:13.000000000 +0800
@@ -14,6 +14,7 @@
 #include "extent_map.h"
 #include "compat.h"
 #include "ctree.h"
+#include "disk-io.h"
 #include "btrfs_inode.h"
 
 static struct kmem_cache *extent_state_cache;
@@ -3162,6 +3163,7 @@ struct extent_buffer *alloc_extent_buffe
 			uptodate = 0;
 		unlock_page(p);
 	}
+	set_bit(EXTENT_BUFFER_UNCHECKED, &eb->bflags);
 	if (uptodate)
 		set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
 
@@ -3387,9 +3389,10 @@ int read_extent_buffer_pages(struct exte
 	unsigned long num_pages;
 	struct bio *bio = NULL;
 	unsigned long bio_flags = 0;
+	struct btrfs_root *root = BTRFS_I(tree->mapping->host)->root;
 
 	if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
-		return 0;
+		goto out;
 
 	if (test_range_bit(tree, eb->start, eb->start + eb->len - 1,
 			   EXTENT_UPTODATE, 1, NULL)) {
@@ -3454,8 +3457,10 @@ int read_extent_buffer_pages(struct exte
 			ret = -EIO;
 	}
 
-	if (!ret)
+	if (!ret) {
 		set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
+		goto out;
+	}
 	return ret;
 
 unlock_exit:
@@ -3466,6 +3471,11 @@ unlock_exit:
 		unlock_page(page);
 		locked_pages--;
 	}
+out:
+	if (!ret && test_bit(EXTENT_BUFFER_UNCHECKED, &eb->bflags) &&
+		test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
+		return btree_validate_extent_buffer(root, eb);
+
 	return ret;
 }
 
Index: linux/fs/btrfs/extent_io.h
===================================================================
--- linux.orig/fs/btrfs/extent_io.h	2010-07-19 10:56:33.000000000 +0800
+++ linux/fs/btrfs/extent_io.h	2010-07-19 10:58:13.000000000 +0800
@@ -27,6 +27,7 @@
 #define EXTENT_BUFFER_UPTODATE 0
 #define EXTENT_BUFFER_BLOCKING 1
 #define EXTENT_BUFFER_DIRTY 2
+#define EXTENT_BUFFER_UNCHECKED 3
 
 /* these are flags for extent_clear_unlock_delalloc */
 #define EXTENT_CLEAR_UNLOCK_PAGE 0x1
Index: linux/fs/btrfs/extent-tree.c
===================================================================
--- linux.orig/fs/btrfs/extent-tree.c	2010-07-19 10:56:33.000000000 +0800
+++ linux/fs/btrfs/extent-tree.c	2010-07-19 10:58:13.000000000 +0800
@@ -5277,6 +5277,7 @@ struct extent_buffer *btrfs_init_new_buf
 	clean_tree_block(trans, root, buf);
 
 	btrfs_set_lock_blocking(buf);
+	clear_bit(EXTENT_BUFFER_UNCHECKED, &buf->bflags);
 	btrfs_set_buffer_uptodate(buf);
 
 	if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch 0/2]btrfs: add two ioctls to do metadata readahead
  2010-07-14  8:02 [patch 0/2]btrfs: add two ioctls to do metadata readahead Shaohua Li
  2010-07-14  8:26 ` Shaohua Li
@ 2010-07-19  8:03 ` Christoph Hellwig
  2010-07-19 11:25   ` Chris Mason
  2010-07-19 14:00   ` Arjan van de Ven
  1 sibling, 2 replies; 7+ messages in thread
From: Christoph Hellwig @ 2010-07-19  8:03 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Chris Mason, linux-btrfs, Arjan van de Ven, Wu, Fengguang

This seems like iteration 66 of the ill fated mobling readahead crap.

Please go to linux-fsdevel to define a proper interface instead of
your per-filesystem hacks.  I had hoped Arjan got it after the last big
flameware, but it seems like you're only moving from one target to the
next.

NACK in this current form.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch 0/2]btrfs: add two ioctls to do metadata readahead
  2010-07-19  8:03 ` Christoph Hellwig
@ 2010-07-19 11:25   ` Chris Mason
  2010-07-19 14:00   ` Arjan van de Ven
  1 sibling, 0 replies; 7+ messages in thread
From: Chris Mason @ 2010-07-19 11:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Shaohua Li, linux-btrfs, Arjan van de Ven, Wu, Fengguang

On Mon, Jul 19, 2010 at 04:03:54AM -0400, Christoph Hellwig wrote:
> This seems like iteration 66 of the ill fated mobling readahead crap.
> 
> Please go to linux-fsdevel to define a proper interface instead of
> your per-filesystem hacks.  I had hoped Arjan got it after the last big
> flameware, but it seems like you're only moving from one target to the
> next.
> 
> NACK in this current form.

Honestly I think its good to have some of these features in specific
filesystems at first.  Sure, it is easy to tell if they help in the lab
and during development, but why go through the pain of implementing them
in generic form until they have been through a broader test?

So the idea for the feature is fine with me, especially for something
like readahead which we can support forever by turning it into a noop if
there turns out to be a better idea.

-chris


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch 0/2]btrfs: add two ioctls to do metadata readahead
  2010-07-19  8:03 ` Christoph Hellwig
  2010-07-19 11:25   ` Chris Mason
@ 2010-07-19 14:00   ` Arjan van de Ven
  2010-07-20  3:58     ` Christoph Hellwig
  1 sibling, 1 reply; 7+ messages in thread
From: Arjan van de Ven @ 2010-07-19 14:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Shaohua Li, Chris Mason, linux-btrfs, Wu, Fengguang

On 7/19/2010 1:03 AM, Christoph Hellwig wrote:
> This seems like iteration 66 of the ill fated mobling readahead crap.
>    

I think you need to actually look at what this patch does before 
rendering this judgment; something you apparently have not done.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch 0/2]btrfs: add two ioctls to do metadata readahead
  2010-07-19 14:00   ` Arjan van de Ven
@ 2010-07-20  3:58     ` Christoph Hellwig
  0 siblings, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2010-07-20  3:58 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Christoph Hellwig, Shaohua Li, Chris Mason, linux-btrfs, Wu, Fengguang

On Mon, Jul 19, 2010 at 07:00:33AM -0700, Arjan van de Ven wrote:
> On 7/19/2010 1:03 AM, Christoph Hellwig wrote:
> >This seems like iteration 66 of the ill fated mobling readahead crap.
> 
> I think you need to actually look at what this patch does before
> rendering this judgment; something you apparently have not done.

Thanks a lot for the attack, but of course I have.  It's the same stupid
idea of throwing almost generic code into filesystems that you previous
attempt did.  It's completely generic code for checking what pages are
in-memory (plus constaints) and then force read-ahead on them again.
The only filesystems specific bit is which address_space to select.  The
whole apporach is also useful to extN / xfs / etc if you allow it on
any address_space.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2010-07-20  3:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-14  8:02 [patch 0/2]btrfs: add two ioctls to do metadata readahead Shaohua Li
2010-07-14  8:26 ` Shaohua Li
2010-07-19  5:43   ` Shaohua Li
2010-07-19  8:03 ` Christoph Hellwig
2010-07-19 11:25   ` Chris Mason
2010-07-19 14:00   ` Arjan van de Ven
2010-07-20  3:58     ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).