linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] btrfs: scrub
@ 2011-03-11 14:49 Arne Jansen
  2011-03-11 14:49 ` [PATCH v2 1/6] btrfs: add parameter to btrfs_lookup_csum_range Arne Jansen
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Arne Jansen @ 2011-03-11 14:49 UTC (permalink / raw)
  To: chris.mason, linux-btrfs, jansen

This series adds an initial implementation for scrub. It works quite
straightforward. The usermode issues an ioctl for each device in the
fs. For each device, it enumerates the allocated device chunks. For
each chunk, the contained extents are enumerated and the data checksums
fetched. The extents are read sequentially and the checksums verified.
If an error occurs (checksum or EIO), a good copy is searched for. If
one is found, the bad copy will be rewritten.
All enumerations happen from the commit roots. During a transaction
commit, the scrubs get paused and afterwards continue from the new
roots.
For future improvements please see the inline comments.

The accompanying user mode patches will follow shortly.

This v2 mainly changes the dev_info ioctl interface.

Thanks,
Arne


Arne Jansen (5):
  btrfs: add parameter to btrfs_lookup_csum_range
  btrfs: make struct map_lookup public
  btrfs: add scrub code and prototypes
  btrfs: sync scrub with commit & device removal
  btrfs: add state information for scrub

Jan Schmidt (1):
  btrfs: new ioctls for scrub

 fs/btrfs/Makefile      |    2 +-
 fs/btrfs/ctree.h       |   46 ++-
 fs/btrfs/disk-io.c     |   16 +
 fs/btrfs/file-item.c   |    8 +-
 fs/btrfs/inode.c       |    2 +-
 fs/btrfs/ioctl.c       |  131 +++++
 fs/btrfs/ioctl.h       |   55 ++
 fs/btrfs/relocation.c  |    2 +-
 fs/btrfs/scrub.c       | 1463 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/transaction.c |    3 +
 fs/btrfs/tree-log.c    |    6 +-
 fs/btrfs/volumes.c     |   16 +-
 fs/btrfs/volumes.h     |   17 +
 13 files changed, 1743 insertions(+), 24 deletions(-)
 create mode 100644 fs/btrfs/scrub.c

-- 
1.7.3.4


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/6] btrfs: add parameter to btrfs_lookup_csum_range
  2011-03-11 14:49 [PATCH v2 0/6] btrfs: scrub Arne Jansen
@ 2011-03-11 14:49 ` Arne Jansen
  2011-03-11 14:49 ` [PATCH v2 2/6] btrfs: make struct map_lookup public Arne Jansen
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Arne Jansen @ 2011-03-11 14:49 UTC (permalink / raw)
  To: chris.mason, linux-btrfs, jansen

A parameter is added to search the commit root instead of the live root.

Signed-off-by: Arne Jansen <sensille@gmx.net>
---
 fs/btrfs/ctree.h      |    4 ++--
 fs/btrfs/file-item.c  |    8 +++++++-
 fs/btrfs/inode.c      |    2 +-
 fs/btrfs/relocation.c |    2 +-
 fs/btrfs/tree-log.c   |    4 ++--
 5 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 28188a7..4c99834 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2450,8 +2450,8 @@ struct btrfs_csum_item *btrfs_lookup_csum(struct btrfs_trans_handle *trans,
 int btrfs_csum_truncate(struct btrfs_trans_handle *trans,
 			struct btrfs_root *root, struct btrfs_path *path,
 			u64 isize);
-int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start,
-			     u64 end, struct list_head *list);
+int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
+			     struct list_head *list, int search_commit);
 /* inode.c */
 
 /* RHEL and EL kernels have a patch that renames PG_checked to FsMisc */
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 4f19a3e..9643d6e 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -263,7 +263,7 @@ int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
 }
 
 int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
-			     struct list_head *list)
+			     struct list_head *list, int search_commit)
 {
 	struct btrfs_key key;
 	struct btrfs_path *path;
@@ -280,6 +280,12 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
 	path = btrfs_alloc_path();
 	BUG_ON(!path);
 
+	if (search_commit) {
+		path->skip_locking = 1;
+		path->reada = 2;
+		path->search_commit_root = 1;
+	}
+
 	key.objectid = BTRFS_EXTENT_CSUM_OBJECTID;
 	key.offset = start;
 	key.type = BTRFS_EXTENT_CSUM_KEY;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 44b9266..8fdb5f6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1001,7 +1001,7 @@ static noinline int csum_exist_in_range(struct btrfs_root *root,
 	LIST_HEAD(list);
 
 	ret = btrfs_lookup_csums_range(root->fs_info->csum_root, bytenr,
-				       bytenr + num_bytes - 1, &list);
+				       bytenr + num_bytes - 1, &list, 0);
 	if (ret == 0 && list_empty(&list))
 		return 0;
 
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 31ade58..d7ae412 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4236,7 +4236,7 @@ int btrfs_reloc_clone_csums(struct inode *inode, u64 file_pos, u64 len)
 
 	disk_bytenr = file_pos + BTRFS_I(inode)->index_cnt;
 	ret = btrfs_lookup_csums_range(root->fs_info->csum_root, disk_bytenr,
-				       disk_bytenr + len - 1, &list);
+				       disk_bytenr + len - 1, &list, 0);
 
 	while (!list_empty(&list)) {
 		sums = list_entry(list.next, struct btrfs_ordered_sum, list);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index a4bbb85..1f6788f 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -614,7 +614,7 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans,
 
 			ret = btrfs_lookup_csums_range(root->log_root,
 						csum_start, csum_end - 1,
-						&ordered_sums);
+						&ordered_sums, 0);
 			BUG_ON(ret);
 			while (!list_empty(&ordered_sums)) {
 				struct btrfs_ordered_sum *sums;
@@ -2691,7 +2691,7 @@ static noinline int copy_items(struct btrfs_trans_handle *trans,
 				ret = btrfs_lookup_csums_range(
 						log->fs_info->csum_root,
 						ds + cs, ds + cs + cl - 1,
-						&ordered_sums);
+						&ordered_sums, 0);
 				BUG_ON(ret);
 			}
 		}
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/6] btrfs: make struct map_lookup public
  2011-03-11 14:49 [PATCH v2 0/6] btrfs: scrub Arne Jansen
  2011-03-11 14:49 ` [PATCH v2 1/6] btrfs: add parameter to btrfs_lookup_csum_range Arne Jansen
@ 2011-03-11 14:49 ` Arne Jansen
  2011-03-11 14:49 ` [PATCH v2 3/6] btrfs: add scrub code and prototypes Arne Jansen
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Arne Jansen @ 2011-03-11 14:49 UTC (permalink / raw)
  To: chris.mason, linux-btrfs, jansen

definition of struct map_lookup moved from volumes.c to the header

Signed-off-by: Arne Jansen <sensille@gmx.net>
---
 fs/btrfs/volumes.c |   14 --------------
 fs/btrfs/volumes.h |   14 ++++++++++++++
 2 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 94334d9..7dc9fa5 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -33,25 +33,11 @@
 #include "volumes.h"
 #include "async-thread.h"
 
-struct map_lookup {
-	u64 type;
-	int io_align;
-	int io_width;
-	int stripe_len;
-	int sector_size;
-	int num_stripes;
-	int sub_stripes;
-	struct btrfs_bio_stripe stripes[];
-};
-
 static int init_first_rw_device(struct btrfs_trans_handle *trans,
 				struct btrfs_root *root,
 				struct btrfs_device *device);
 static int btrfs_relocate_sys_chunks(struct btrfs_root *root);
 
-#define map_lookup_size(n) (sizeof(struct map_lookup) + \
-			    (sizeof(struct btrfs_bio_stripe) * (n)))
-
 static DEFINE_MUTEX(uuid_mutex);
 static LIST_HEAD(fs_uuids);
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 7af6144..0ccc982 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -129,6 +129,20 @@ struct btrfs_bio_stripe {
 	u64 physical;
 };
 
+struct map_lookup {
+	u64 type;
+	int io_align;
+	int io_width;
+	int stripe_len;
+	int sector_size;
+	int num_stripes;
+	int sub_stripes;
+	struct btrfs_bio_stripe stripes[];
+};
+
+#define map_lookup_size(n) (sizeof(struct map_lookup) + \
+			    (sizeof(struct btrfs_bio_stripe) * (n)))
+
 struct btrfs_multi_bio {
 	atomic_t stripes_pending;
 	bio_end_io_t *end_io;
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 3/6] btrfs: add scrub code and prototypes
  2011-03-11 14:49 [PATCH v2 0/6] btrfs: scrub Arne Jansen
  2011-03-11 14:49 ` [PATCH v2 1/6] btrfs: add parameter to btrfs_lookup_csum_range Arne Jansen
  2011-03-11 14:49 ` [PATCH v2 2/6] btrfs: make struct map_lookup public Arne Jansen
@ 2011-03-11 14:49 ` Arne Jansen
  2011-03-11 16:34   ` David Sterba
  2011-03-11 14:49 ` [PATCH v2 4/6] btrfs: sync scrub with commit & device removal Arne Jansen
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Arne Jansen @ 2011-03-11 14:49 UTC (permalink / raw)
  To: chris.mason, linux-btrfs, jansen

This is the main scrub code.

Signed-off-by: Arne Jansen <sensille@gmx.net>
---
 fs/btrfs/Makefile |    2 +-
 fs/btrfs/ctree.h  |   14 +
 fs/btrfs/scrub.c  | 1463 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 1478 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 31610ea..8fda313 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -7,4 +7,4 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \
 	   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
 	   export.o tree-log.o acl.o free-space-cache.o zlib.o lzo.o \
-	   compression.o delayed-ref.o relocation.o
+	   compression.o delayed-ref.o relocation.o scrub.o
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4c99834..030c321 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2610,4 +2610,18 @@ void btrfs_reloc_pre_snapshot(struct btrfs_trans_handle *trans,
 			      u64 *bytes_to_reserve);
 void btrfs_reloc_post_snapshot(struct btrfs_trans_handle *trans,
 			      struct btrfs_pending_snapshot *pending);
+
+/* scrub.c */
+int btrfs_scrub_dev(struct btrfs_root *root, u64 devid, u64 start, u64 end,
+                    struct btrfs_scrub_progress *progress);
+int btrfs_scrub_pause(struct btrfs_root *root);
+int btrfs_scrub_pause_super(struct btrfs_root *root);
+int btrfs_scrub_continue(struct btrfs_root *root);
+int btrfs_scrub_continue_super(struct btrfs_root *root);
+int btrfs_scrub_cancel(struct btrfs_root *root);
+int btrfs_scrub_cancel_dev(struct btrfs_root *root, struct btrfs_device *dev);
+int btrfs_scrub_cancel_devid(struct btrfs_root *root, u64 devid);
+int btrfs_scrub_progress(struct btrfs_root *root, u64 devid,
+                         struct btrfs_scrub_progress *progress);
+
 #endif
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
new file mode 100644
index 0000000..d606f4d
--- /dev/null
+++ b/fs/btrfs/scrub.c
@@ -0,0 +1,1463 @@
+/*
+ * Copyright (C) 2011 STRATO.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/sched.h>
+#include <linux/pagemap.h>
+#include <linux/writeback.h>
+#include <linux/blkdev.h>
+#include <linux/rbtree.h>
+#include <linux/slab.h>
+#include <linux/workqueue.h>
+#include "ctree.h"
+#include "volumes.h"
+#include "disk-io.h"
+#include "ordered-data.h"
+
+/*
+ * This is only the first step towards a full-features scrub. It reads all
+ * extent and super block and verifies the checksums. In case a bad checksum
+ * is found or the extent cannot be read, good data will be written back if
+ * any can be found.
+ *
+ * Future enhancements:
+ *  - To enhance the performance, better read-ahead strategies for the
+ *    extent-tree can be employed.
+ *  - In case an unrepairable extent is encountered, track which files are
+ *    affected and report them
+ *  - In case of a read error on files with nodatasum, map the file and read
+ *    the extent to trigger a writeback of the good copy
+ *  - track and record media errors, throw out bad devices
+ *  - add a readonly mode
+ *  - add a mode to also read unallocated space
+ */
+
+#ifdef SCRUB_BTRFS_WORKER
+typedef struct btrfs_work scrub_work_t;
+#define SCRUB_INIT_WORK(work, fn) do { (work)->func = (fn); } while (0)
+#define SCRUB_QUEUE_WORK(wq, w) do { btrfs_queue_worker(&(wq), w); } while (0)
+#else
+typedef struct work_struct scrub_work_t;
+#define SCRUB_INIT_WORK INIT_WORK
+#define SCRUB_QUEUE_WORK queue_work
+#endif
+
+struct scrub_bio;
+struct scrub_page;
+struct scrub_dev;
+struct scrub_fixup;
+static void scrub_bio_end_io(struct bio *bio, int err);
+static void scrub_checksum(scrub_work_t *work);
+static int scrub_checksum_data(struct scrub_dev *sdev,
+                               struct scrub_page *spag, void *buffer);
+static int scrub_checksum_tree_block(struct scrub_dev *sdev,
+                                     struct scrub_page *spag, u64 logical,
+                                     void *buffer);
+static int scrub_checksum_super(struct scrub_bio *sbio, void *buffer);
+static void scrub_recheck_end_io(struct bio *bio, int err);
+static void scrub_fixup_worker(scrub_work_t *work);
+static void scrub_fixup(struct scrub_fixup *fixup);
+
+#define SCRUB_PAGES_PER_BIO	16	/* 64k per bio */
+#define SCRUB_BIOS_PER_DEV	16	/* 1 MB per device in flight */
+
+struct scrub_page {
+	u64			flags;  /* extent flags */
+	u64			generation;
+	u64			mirror_num;
+	int			have_csum;
+	u8			csum[BTRFS_CSUM_SIZE];
+};
+
+struct scrub_bio {
+	int			index;
+	struct scrub_dev	*sdev;
+	struct bio		*bio;
+	int			err;
+	u64			logical;
+	u64			physical;
+	struct scrub_page	spag[SCRUB_PAGES_PER_BIO];
+	u64			count;
+	int			next_free;
+	scrub_work_t		work;
+};
+
+struct scrub_dev {
+	struct scrub_bio	bios[SCRUB_BIOS_PER_DEV];
+	struct btrfs_device	*dev;
+	int			first_free;
+	int			curr;
+	atomic_t		in_flight;
+	spinlock_t		list_lock;
+	wait_queue_head_t	list_wait;
+	u16			csum_size;
+	struct list_head	csum_list;
+	atomic_t		cancel_req;
+	/*
+	 * statistics
+	 */
+	struct btrfs_scrub_progress stat;
+	spinlock_t		stat_lock;
+};
+
+struct scrub_fixup {
+	struct scrub_dev	*sdev;
+	struct bio		*bio;
+	u64			logical;
+	u64			physical;
+	struct scrub_page	spag;
+	scrub_work_t		work;
+	int			err;
+	int			recheck;
+};
+
+static void scrub_free_csums(struct scrub_dev *sdev)
+{
+	while(!list_empty(&sdev->csum_list)) {
+		struct btrfs_ordered_sum *sum;
+		sum = list_first_entry(&sdev->csum_list,
+		                       struct btrfs_ordered_sum, list);
+		list_del(&sum->list);
+		kfree(sum);
+	}
+}
+
+static noinline_for_stack void scrub_free_dev(struct scrub_dev *sdev)
+{
+	int i;
+	int j;
+	struct page *last_page;
+
+	if (!sdev)
+		return;
+
+	for (i = 0; i < SCRUB_BIOS_PER_DEV; ++i) {
+		struct bio *bio = sdev->bios[i].bio;
+		if (bio)
+			break;
+		
+		last_page = NULL;
+		for (j = 0; j < bio->bi_vcnt; ++j) {
+			if (bio->bi_io_vec[i].bv_page == last_page)
+				continue;
+			last_page = bio->bi_io_vec[i].bv_page;
+			__free_page(last_page);
+		}
+		bio_put(sdev->bios[i].bio);
+	}
+
+	scrub_free_csums(sdev);
+	kfree(sdev);
+}
+
+static noinline_for_stack
+struct scrub_dev *scrub_setup_dev(struct btrfs_device *dev)
+{
+	struct scrub_dev *sdev;
+	int		i;
+	int		j;
+	int		ret;
+	struct btrfs_fs_info *fs_info = dev->dev_root->fs_info;
+	sdev = kzalloc(sizeof(*sdev), GFP_NOFS);
+	if (!sdev)
+		goto nomem;
+	sdev->dev = dev;
+	for (i = 0; i < SCRUB_BIOS_PER_DEV; ++i) {
+		struct bio *bio;
+
+		bio = bio_alloc(GFP_NOFS, SCRUB_PAGES_PER_BIO);
+		if (!bio)
+			goto nomem;
+
+		sdev->bios[i].index = i;
+		sdev->bios[i].sdev = sdev;
+		sdev->bios[i].bio = bio;
+		sdev->bios[i].count = 0;
+		SCRUB_INIT_WORK(&sdev->bios[i].work, scrub_checksum);
+		bio->bi_private = sdev->bios + i;
+		bio->bi_end_io = scrub_bio_end_io;
+		bio->bi_sector = 0;
+		bio->bi_bdev = dev->bdev;
+		bio->bi_size = 0;
+
+		for (j = 0; j < SCRUB_PAGES_PER_BIO; ++j) {
+			struct page *page;
+			page = alloc_page(GFP_NOFS);
+			if (!page)
+				goto nomem;
+
+			ret = bio_add_page(bio, page, PAGE_SIZE, 0);
+			if (!ret)
+				goto nomem;
+		}
+		WARN_ON(bio->bi_vcnt != SCRUB_PAGES_PER_BIO);
+
+		if (i != SCRUB_BIOS_PER_DEV-1)
+			sdev->bios[i].next_free = i + 1;
+		 else
+			sdev->bios[i].next_free = -1;
+	}
+	sdev->first_free = 0;
+	sdev->curr = -1;
+	atomic_set(&sdev->in_flight, 0);
+	atomic_set(&sdev->cancel_req, 0);
+	sdev->csum_size = btrfs_super_csum_size(&fs_info->super_copy);
+	INIT_LIST_HEAD(&sdev->csum_list);
+	
+	spin_lock_init(&sdev->list_lock);
+	spin_lock_init(&sdev->stat_lock);
+	init_waitqueue_head(&sdev->list_wait);
+	return sdev;
+
+nomem:
+	scrub_free_dev(sdev);
+	return ERR_PTR(-ENOMEM);
+}
+
+/*
+ * scrub_recheck_error gets called when either verification of the page
+ * failed or the bio failed to read, e.g. with EIO. In the latter case,
+ * recheck_error gets called for every page in the bio, even though only
+ * one may be bad
+ */
+static void scrub_recheck_error(struct scrub_bio *sbio, int ix)
+{
+	struct scrub_dev *sdev = sbio->sdev;
+	struct btrfs_fs_info *fs_info = sdev->dev->dev_root->fs_info;
+	struct bio *bio = NULL;
+	struct page *page = NULL;
+	struct scrub_fixup *fixup = NULL;
+	int ret;
+
+	/*
+	 * while we're in here we do not want the transaction to commit.
+	 * To prevent it, we increment scrubs_running. scrub_pause will
+	 * have to wait until we're finished
+	 */
+	mutex_lock(&fs_info->scrub_lock);
+	atomic_inc(&fs_info->scrubs_running);
+	mutex_unlock(&fs_info->scrub_lock);
+
+	fixup = kzalloc(sizeof(*fixup), GFP_NOFS);
+	if (!fixup)
+		goto malloc_error;
+
+	fixup->logical = sbio->logical + ix * PAGE_SIZE;
+	fixup->physical = sbio->physical + ix * PAGE_SIZE;
+	fixup->spag = sbio->spag[ix];
+	fixup->sdev = sdev;
+
+	bio = bio_alloc(GFP_NOFS, 1);
+	if (!bio)
+		goto malloc_error;
+	bio->bi_private = fixup;
+	bio->bi_size = 0;
+	bio->bi_bdev = sdev->dev->bdev;	/* FIXME: temporary for add_page */
+	fixup->bio = bio;
+	fixup->recheck = 0;
+
+	page = alloc_page(GFP_NOFS);
+	if (!page)
+		goto malloc_error;
+
+	ret = bio_add_page(bio, page, PAGE_SIZE, 0);
+	if (!ret)
+		goto malloc_error;
+
+	if (!sbio->err) {
+		/*
+		 * shorter path: just a checksum error, go ahead and correct it
+		 */
+		scrub_fixup_worker(&fixup->work);
+		return;
+	}
+
+	/*
+	 * an I/O-error occured for one of the blocks in the bio, not
+	 * necessarily for this one, so first try to read it separately
+	 */
+	SCRUB_INIT_WORK(&fixup->work, scrub_fixup_worker);
+	fixup->recheck = 1;
+	bio->bi_end_io = scrub_recheck_end_io;
+	bio->bi_sector = fixup->physical >> 9;
+	bio->bi_bdev = sdev->dev->bdev;
+	submit_bio(0, bio);
+
+	return;
+
+malloc_error:
+	if (bio) 
+		bio_put(bio);
+	if (page)
+		__free_page(page);
+	if (fixup)
+		kfree(fixup);
+	spin_lock(&sdev->stat_lock);
+	++sdev->stat.malloc_errors;
+	spin_unlock(&sdev->stat_lock);
+	mutex_lock(&fs_info->scrub_lock);
+	atomic_dec(&fs_info->scrubs_running);
+	mutex_unlock(&fs_info->scrub_lock);
+	wake_up(&fs_info->scrub_pause_wait);
+}
+
+static void scrub_recheck_end_io(struct bio *bio, int err)
+{
+	struct scrub_fixup *fixup = bio->bi_private;
+	struct btrfs_fs_info *fs_info = fixup->sdev->dev->dev_root->fs_info;
+
+	fixup->err = err;
+	SCRUB_QUEUE_WORK(fs_info->scrub_workers, &fixup->work);
+}
+
+static int scrub_fixup_check(struct scrub_fixup *fixup)
+{
+	int ret = 1;
+	struct page *page;
+	void *buffer;
+	u64 flags = fixup->spag.flags;
+
+	page = fixup->bio->bi_io_vec[0].bv_page;
+	buffer = kmap_atomic(page, KM_USER0);
+	if (flags & BTRFS_EXTENT_FLAG_DATA) {
+		ret = scrub_checksum_data(fixup->sdev,
+					  &fixup->spag, buffer);
+	} else if (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
+		ret = scrub_checksum_tree_block(fixup->sdev,
+						&fixup->spag,
+						fixup->logical,
+						buffer);
+	} else {
+		WARN_ON(1);
+	}
+	kunmap_atomic(buffer, KM_USER0);
+
+	return ret;
+}
+
+static void scrub_fixup_worker(scrub_work_t *work)
+{
+	struct scrub_fixup *fixup;
+	struct btrfs_fs_info *fs_info;
+	u64 flags;
+	int ret = 1;
+
+	fixup = container_of(work, struct scrub_fixup, work);
+	fs_info = fixup->sdev->dev->dev_root->fs_info;
+	flags = fixup->spag.flags;
+
+	if (fixup->recheck && fixup->err == 0)
+		ret = scrub_fixup_check(fixup);
+
+	if (ret || fixup->err)
+		scrub_fixup(fixup);
+
+	__free_page(fixup->bio->bi_io_vec[0].bv_page);
+	bio_put(fixup->bio);
+
+	mutex_lock(&fs_info->scrub_lock);
+	atomic_dec(&fs_info->scrubs_running);
+	mutex_unlock(&fs_info->scrub_lock);
+	wake_up(&fs_info->scrub_pause_wait);
+
+	kfree(fixup);
+}
+
+static void scrub_fixup_end_io(struct bio *bio, int err)
+{
+	complete((struct completion *)bio->bi_private);
+}
+
+static void scrub_fixup(struct scrub_fixup *fixup)
+{
+	struct scrub_dev *sdev = fixup->sdev;
+	struct btrfs_fs_info *fs_info = sdev->dev->dev_root->fs_info;
+	struct btrfs_mapping_tree *map_tree = &fs_info->mapping_tree; 
+	struct btrfs_multi_bio *multi = NULL;
+	struct bio *bio = fixup->bio;
+	u64 length;
+	int i;
+	int ret;
+	DECLARE_COMPLETION_ONSTACK(complete);
+
+	if ((fixup->spag.flags & BTRFS_EXTENT_FLAG_DATA) &&
+	    (fixup->spag.have_csum == 0)) {
+		/*
+		 * nodatasum, don't try to fix anything
+		 * FIXME: we can do better, open the inode and trigger a
+		 * writeback
+		 */
+		goto uncorrectable;
+	}
+
+	length = PAGE_SIZE;
+	ret = btrfs_map_block(map_tree, REQ_WRITE, fixup->logical, &length,
+	                      &multi, 0);
+	if (ret || !multi || length < PAGE_SIZE) {
+		printk(KERN_ERR
+		       "scrub_fixup: btrfs_map_block failed us for %lld\n",
+		       fixup->logical);
+		WARN_ON(1);
+		return;
+	}
+
+	if (multi->num_stripes == 1) {
+		/* there aren't any replicas */
+		goto uncorrectable;
+	}
+
+	/*
+	 * first find a good copy
+	 */
+	for (i = 0; i < multi->num_stripes; ++i) {
+		if (i == fixup->spag.mirror_num)
+			continue;
+
+		bio->bi_sector = multi->stripes[i].physical >> 9;
+		bio->bi_bdev = multi->stripes[i].dev->bdev;
+		bio->bi_size = PAGE_SIZE;
+		bio->bi_next = NULL;
+		bio->bi_flags = 1 << BIO_UPTODATE;
+		bio->bi_comp_cpu = -1;
+		bio->bi_end_io = scrub_fixup_end_io;
+		bio->bi_private = &complete;
+
+		submit_bio(0, bio);
+
+		wait_for_completion(&complete);
+
+		if (~bio->bi_flags & BIO_UPTODATE)
+			/* I/O-error, this is not a good copy */
+			continue;
+
+		ret = scrub_fixup_check(fixup);
+		if (ret == 0)
+			break;
+	}
+	if (i == multi->num_stripes)
+		goto uncorrectable;
+
+	/*
+	 * the bio now contains good data, write it back
+	 */
+	bio->bi_sector = fixup->physical >> 9;
+	bio->bi_bdev = sdev->dev->bdev;
+	bio->bi_size = PAGE_SIZE;
+	bio->bi_next = NULL;
+	bio->bi_flags = 1 << BIO_UPTODATE;
+	bio->bi_comp_cpu = -1;
+	bio->bi_end_io = scrub_fixup_end_io;
+	bio->bi_private = &complete;
+
+	submit_bio(REQ_WRITE, bio);
+
+	wait_for_completion(&complete);
+
+	if (~bio->bi_flags & BIO_UPTODATE)
+		/* I/O-error, writeback failed, give up */
+		goto uncorrectable;
+
+	kfree(multi);
+	spin_lock(&sdev->stat_lock);
+	++sdev->stat.corrected_errors;
+	spin_unlock(&sdev->stat_lock);
+
+	if (printk_ratelimit())
+		printk(KERN_ERR "btrfs: fixed up at %lld\n", fixup->logical);
+	return;
+
+uncorrectable:
+	kfree(multi);
+	spin_lock(&sdev->stat_lock);
+	++sdev->stat.uncorrectable_errors;
+	spin_unlock(&sdev->stat_lock);
+
+	if (printk_ratelimit())
+		printk(KERN_ERR "btrfs: unable to fixup at %lld\n",
+			 fixup->logical);
+}
+
+static void scrub_bio_end_io(struct bio *bio, int err)
+{
+	struct scrub_bio *sbio = bio->bi_private;
+	struct scrub_dev *sdev = sbio->sdev;
+	struct btrfs_fs_info *fs_info = sdev->dev->dev_root->fs_info;
+
+	sbio->err = err;
+
+	SCRUB_QUEUE_WORK(fs_info->scrub_workers, &sbio->work);
+}
+
+static void scrub_checksum(scrub_work_t *work)
+{
+	struct scrub_bio *sbio = container_of(work, struct scrub_bio, work);
+	struct scrub_dev *sdev = sbio->sdev;
+	struct page *page;
+	void *buffer;
+	int i;
+	u64 flags;
+	u64 logical;
+	int ret;
+
+	if (sbio->err) {
+		for (i = 0; i < sbio->count; ++i) {
+			scrub_recheck_error(sbio, i);
+		}
+		spin_lock(&sdev->stat_lock);
+		++sdev->stat.read_errors;
+		spin_unlock(&sdev->stat_lock);
+		goto out;
+	}
+	for (i = 0; i < sbio->count; ++i) {
+		page = sbio->bio->bi_io_vec[i].bv_page;
+		buffer = kmap_atomic(page, KM_USER0);
+		flags = sbio->spag[i].flags;
+		logical = sbio->logical + i * PAGE_SIZE;
+		ret = 0;
+		if (flags & BTRFS_EXTENT_FLAG_DATA) {
+			ret = scrub_checksum_data(sdev, sbio->spag + i, buffer);
+		} else if (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
+			ret = scrub_checksum_tree_block(sdev, sbio->spag + i,
+			                                logical, buffer);
+		} else if (flags & BTRFS_EXTENT_FLAG_SUPER) {
+			BUG_ON(i);
+			(void)scrub_checksum_super(sbio, buffer);
+		} else {
+			WARN_ON(1);
+		}
+		kunmap_atomic(buffer, KM_USER0);
+		if (ret)
+			scrub_recheck_error(sbio, i);
+	}
+
+out:
+	spin_lock(&sdev->list_lock);
+	sbio->next_free = sdev->first_free;
+	sdev->first_free = sbio->index;
+	spin_unlock(&sdev->list_lock);
+	atomic_dec(&sdev->in_flight);
+	wake_up(&sdev->list_wait);
+}
+
+static int scrub_checksum_data(struct scrub_dev *sdev,
+                               struct scrub_page *spag, void *buffer)
+{
+	u8 csum[BTRFS_CSUM_SIZE];
+	u32 crc = ~(u32)0;
+	int fail = 0;
+	struct btrfs_root *root = sdev->dev->dev_root;
+
+	if (!spag->have_csum)
+		return 0;
+
+	crc = btrfs_csum_data(root, buffer, crc, PAGE_SIZE);
+	btrfs_csum_final(crc, csum);
+	if (memcmp(csum, spag->csum, sdev->csum_size))
+		fail = 1;
+
+	spin_lock(&sdev->stat_lock);
+	++sdev->stat.data_extents_scrubbed;
+	sdev->stat.data_bytes_scrubbed += PAGE_SIZE;
+	if (fail)
+		++sdev->stat.csum_errors;
+	spin_unlock(&sdev->stat_lock);
+
+	return fail;
+}
+
+static int scrub_checksum_tree_block(struct scrub_dev *sdev,
+                                     struct scrub_page *spag, u64 logical,
+                                     void *buffer)
+{
+	struct btrfs_header *h;
+	struct btrfs_root *root = sdev->dev->dev_root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	u8 csum[BTRFS_CSUM_SIZE];
+	u32 crc = ~(u32)0;
+	int fail = 0;
+	int crc_fail = 0;
+
+	/*
+	 * we don't use the getter functions here, as we
+	 * a) don't have an extent buffer and
+	 * b) the page is already kmapped
+	 */
+	h = (struct btrfs_header *)buffer;
+
+	if (logical != le64_to_cpu(h->bytenr))
+		++fail;
+
+	if (spag->generation != le64_to_cpu(h->generation))
+		++fail;
+
+	if (memcmp(h->fsid, fs_info->fsid, BTRFS_UUID_SIZE))
+		++fail;
+
+	if (memcmp(h->chunk_tree_uuid, fs_info->chunk_tree_uuid,
+	           BTRFS_UUID_SIZE))
+		++fail;
+
+	crc = btrfs_csum_data(root, buffer + BTRFS_CSUM_SIZE, crc,
+	                      PAGE_SIZE - BTRFS_CSUM_SIZE);
+	btrfs_csum_final(crc, csum);
+	if (memcmp(csum, h->csum, sdev->csum_size))
+		++crc_fail;
+
+	spin_lock(&sdev->stat_lock);
+	++sdev->stat.tree_extents_scrubbed;
+	sdev->stat.tree_bytes_scrubbed += PAGE_SIZE;
+	if (crc_fail)
+		++sdev->stat.csum_errors;
+	if (fail)
+		++sdev->stat.verify_errors;
+	spin_unlock(&sdev->stat_lock);
+
+	return (fail || crc_fail);
+}
+
+static int scrub_checksum_super(struct scrub_bio *sbio, void *buffer)
+{
+	struct btrfs_super_block *s;
+	u64 logical;
+	struct scrub_dev *sdev = sbio->sdev;
+	struct btrfs_root *root = sdev->dev->dev_root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	u8 csum[BTRFS_CSUM_SIZE];
+	u32 crc = ~(u32)0;
+	int fail = 0;
+
+	s = (struct btrfs_super_block *)buffer;
+	logical = sbio->logical;
+
+	if (logical != le64_to_cpu(s->bytenr))
+		++fail;
+
+	if (sbio->spag[0].generation != le64_to_cpu(s->generation))
+		++fail;
+
+	if (memcmp(s->fsid, fs_info->fsid, BTRFS_UUID_SIZE))
+		++fail;
+
+	crc = btrfs_csum_data(root, buffer + BTRFS_CSUM_SIZE, crc,
+	                      PAGE_SIZE - BTRFS_CSUM_SIZE);
+	btrfs_csum_final(crc, csum);
+	if (memcmp(csum, s->csum, sbio->sdev->csum_size))
+		++fail;
+
+	if (fail) {
+		/*
+		 * if we find an error in a super block, we just report it.
+		 * They will get written with the next transaction commit
+		 * anyway
+		 */
+		spin_lock(&sdev->stat_lock);
+		++sdev->stat.super_errors;
+		spin_unlock(&sdev->stat_lock);
+	}
+
+	return fail;
+}
+
+static int scrub_submit(struct scrub_dev *sdev)
+{
+	struct scrub_bio *sbio;
+
+	if (sdev->curr == -1)
+		return 0;
+
+	sbio = sdev->bios + sdev->curr;
+	
+	sbio->bio->bi_sector = sbio->physical >> 9;
+	sbio->bio->bi_size = sbio->count * PAGE_SIZE;
+	sbio->bio->bi_next = NULL;
+	sbio->bio->bi_flags = 1 << BIO_UPTODATE;
+	sbio->bio->bi_comp_cpu = -1;
+	sbio->bio->bi_bdev = sdev->dev->bdev;
+	sdev->curr = -1;
+	atomic_inc(&sdev->in_flight);
+
+	submit_bio(0, sbio->bio);
+
+	return 0;
+}
+
+static int scrub_page(struct scrub_dev *sdev, u64 logical, u64 len,
+                      u64 physical, u64 flags, u64 gen, u64 mirror_num,
+                      u8 *csum, int force)
+{
+	struct scrub_bio *sbio;
+again:
+	/*
+	 * grab a fresh bio or wait for one to become available
+	 */
+	while (sdev->curr == -1) {
+		unsigned long flags;
+		spin_lock_irqsave(&sdev->list_lock, flags);
+		sdev->curr = sdev->first_free;
+		if (sdev->curr != -1) {
+			sdev->first_free = sdev->bios[sdev->curr].next_free;
+			sdev->bios[sdev->curr].next_free = -1;
+			sdev->bios[sdev->curr].count = 0;
+			spin_unlock_irqrestore(&sdev->list_lock, flags);
+		} else {
+			spin_unlock_irqrestore(&sdev->list_lock, flags);
+			wait_event(sdev->list_wait, sdev->first_free != -1);
+		}
+	}
+	sbio = sdev->bios + sdev->curr;
+	if (sbio->count == 0) {
+		sbio->physical = physical;
+		sbio->logical = logical;
+	} else if (sbio->physical + sbio->count * PAGE_SIZE != physical) {
+		scrub_submit(sdev);
+		goto again;
+	}
+	sbio->spag[sbio->count].flags = flags;
+	sbio->spag[sbio->count].generation = gen;
+	sbio->spag[sbio->count].have_csum = 0;
+	sbio->spag[sbio->count].mirror_num = mirror_num;
+	if (csum) {
+		sbio->spag[sbio->count].have_csum = 1;
+		memcpy(sbio->spag[sbio->count].csum, csum, sdev->csum_size);
+	}
+	++sbio->count;
+	if (sbio->count == SCRUB_PAGES_PER_BIO || force)
+		scrub_submit(sdev);
+		
+	return 0;
+}
+
+static int scrub_find_csum(struct scrub_dev *sdev, u64 logical, u64 len,
+                           u8 *csum)
+{
+	struct btrfs_ordered_sum *sum = NULL;
+	int ret = 0;
+	unsigned long i;
+	unsigned long num_sectors;
+	u32 sectorsize = sdev->dev->dev_root->sectorsize;
+
+	while (!list_empty(&sdev->csum_list)) {
+		sum = list_first_entry(&sdev->csum_list,
+				       struct btrfs_ordered_sum, list);
+		if (sum->bytenr > logical)
+			return 0;
+		if (sum->bytenr + sum->len > logical)
+			break;
+
+		++sdev->stat.csum_discards;
+		list_del(&sum->list);
+		kfree(sum);
+		sum = NULL;
+	}
+	if (!sum)
+		return 0;
+
+	num_sectors = sum->len / sectorsize;
+	for (i = 0; i < num_sectors; ++i) {
+		if (sum->sums[i].bytenr == logical) {
+			memcpy(csum, &sum->sums[i].sum, sdev->csum_size);
+			ret = 1;
+			break;
+		}
+	}
+	if (ret && i == num_sectors - 1) {
+		list_del(&sum->list);
+		kfree(sum);
+	}
+	return ret;
+}
+
+/* scrub extent tries to collect up to 64 kB for each bio */
+static int scrub_extent(struct scrub_dev *sdev, u64 logical, u64 len,
+                        u64 physical, u64 flags, u64 gen, u64 mirror_num)
+{
+	int ret;
+	u8 csum[BTRFS_CSUM_SIZE];
+
+	while(len) {
+		u64 l = min_t(u64, len, PAGE_SIZE);
+		int have_csum = 0;
+
+		if (flags & BTRFS_EXTENT_FLAG_DATA) {
+			/* push csums to sbio */
+			have_csum = scrub_find_csum(sdev, logical, l, csum);
+			if (have_csum == 0)
+				++sdev->stat.no_csum;
+		}
+		ret = scrub_page(sdev, logical, l, physical, flags, gen,
+		                 mirror_num, have_csum ? csum : NULL, 0);
+		if (ret)
+			return ret;
+		len -= l;
+		logical += l;
+		physical += l;
+	}
+	return 0;
+}
+
+static noinline_for_stack int scrub_stripe(struct scrub_dev *sdev,
+	struct map_lookup *map, int num, u64 base, u64 length)
+{
+	struct btrfs_path *path;
+	struct btrfs_fs_info *fs_info = sdev->dev->dev_root->fs_info;
+	struct btrfs_root *root = fs_info->extent_root;
+	struct btrfs_root *csum_root = fs_info->csum_root;
+	struct btrfs_extent_item *extent;
+	u64 flags;
+	int ret;
+	int slot;
+	int i;
+	int nstripes;
+	int start_stripe;
+	struct extent_buffer *l;
+	struct btrfs_key key;
+	u64 physical;
+	u64 logical;
+	u64 generation;
+	u64 mirror_num;
+
+	u64 increment = map->stripe_len;
+	u64 offset;
+
+	nstripes = length;
+	offset = 0;
+	do_div(nstripes, map->stripe_len);
+	if (map->type & BTRFS_BLOCK_GROUP_RAID0) {
+		offset = map->stripe_len * num;
+		increment = map->stripe_len * map->num_stripes;
+		mirror_num = 0;
+	} else if (map->type & BTRFS_BLOCK_GROUP_RAID10) {
+		int factor = map->num_stripes / map->sub_stripes;
+		offset = map->stripe_len * (num / map->sub_stripes);
+		increment = map->stripe_len * factor;
+		mirror_num = num % map->sub_stripes;
+	} else if (map->type & BTRFS_BLOCK_GROUP_RAID1) {
+		increment = map->stripe_len;
+		mirror_num = num % map->num_stripes;
+	} else if (map->type & BTRFS_BLOCK_GROUP_DUP) {
+		increment = map->stripe_len;
+		mirror_num = num % map->num_stripes;
+	} else {
+		increment = map->stripe_len;
+		mirror_num = 0;
+	}
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	path->reada = 2;
+	path->search_commit_root = 1;
+	path->skip_locking = 1;
+
+	/*
+	 * find all extents for each stripe and just read them to get
+	 * them into the page cache
+	 * FIXME: we can do better. build a more intelligent prefetching
+	 */
+	logical = base + offset;
+	physical = map->stripes[num].physical;
+	ret = 0;
+	for (i = 0; i < nstripes; ++i) {
+		key.objectid = logical;
+		key.type = BTRFS_EXTENT_ITEM_KEY;
+		key.offset = (u64)0;
+
+		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+		if (ret < 0)
+			goto out;
+
+		l = path->nodes[0];
+		slot = path->slots[0];
+		btrfs_item_key_to_cpu(l, &key, slot);
+		if (key.objectid != logical) {
+			ret = btrfs_previous_item(root, path, 0,
+			                          BTRFS_EXTENT_ITEM_KEY);
+			if (ret < 0)
+				goto out;
+		}
+
+		while (1) {
+			l = path->nodes[0];
+			slot = path->slots[0];
+			if (slot >= btrfs_header_nritems(l)) {
+				ret = btrfs_next_leaf(root, path);
+				if (ret == 0)
+					continue;
+				if (ret < 0)
+					goto out;
+
+				break;
+			}
+			btrfs_item_key_to_cpu(l, &key, slot);
+
+			if (key.objectid + key.offset <= logical)
+				goto next1;
+
+			if (key.objectid >= logical + map->stripe_len)
+				break;
+next1:
+			path->slots[0]++;
+		}
+		btrfs_release_path(root, path);
+		logical += increment;
+		physical += map->stripe_len;
+		cond_resched();
+	}
+
+	/*
+	 * collect all data csums for the stripe to avoid seeking during
+	 * the scrub. This might currently (crc32) end up to be about 1MB
+	 */
+	start_stripe = 0;
+again:
+	logical = base + offset + start_stripe * map->stripe_len;
+	physical = map->stripes[num].physical + start_stripe * map->stripe_len;
+	for (i = start_stripe; i < nstripes; ++i) {
+		ret = btrfs_lookup_csums_range(csum_root, logical,
+		                               logical + map->stripe_len - 1,
+		                               &sdev->csum_list, 1);
+		if (ret)
+			goto out;
+
+		logical += increment;
+		cond_resched();
+	}
+	/*
+	 * now find all extents for each stripe and scrub them
+	 */
+	logical = base + offset + start_stripe * map->stripe_len;
+	physical = map->stripes[num].physical + start_stripe * map->stripe_len;
+	ret = 0;
+	for (i = start_stripe; i < nstripes; ++i) {
+		/*
+		 * canceled?
+		 */
+		if (atomic_read(&fs_info->scrub_cancel_req) ||
+		    atomic_read(&sdev->cancel_req)) {
+			ret = -ECANCELED;
+			goto out;
+		}
+		/*
+		 * check to see if we have to pause
+		 */
+		if (atomic_read(&fs_info->scrub_pause_req)) {
+			/* push queued extents */
+			scrub_submit(sdev);
+			wait_event(sdev->list_wait,
+			           atomic_read(&sdev->in_flight) == 0);
+			atomic_inc(&fs_info->scrubs_paused);
+			wake_up(&fs_info->scrub_pause_wait);
+			mutex_lock(&fs_info->scrub_lock);
+			while(atomic_read(&fs_info->scrub_pause_req)) {
+				mutex_unlock(&fs_info->scrub_lock);
+				wait_event(fs_info->scrub_pause_wait,
+				   atomic_read(&fs_info->scrub_pause_req) == 0);
+				mutex_lock(&fs_info->scrub_lock);
+			}
+			atomic_dec(&fs_info->scrubs_paused);
+			mutex_unlock(&fs_info->scrub_lock);
+			wake_up(&fs_info->scrub_pause_wait);
+			scrub_free_csums(sdev);
+			goto again;
+		}
+
+		key.objectid = logical;
+		key.type = BTRFS_EXTENT_ITEM_KEY;
+		key.offset = (u64)0;
+
+		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+		if (ret < 0)
+			goto out;
+
+		l = path->nodes[0];
+		slot = path->slots[0];
+		btrfs_item_key_to_cpu(l, &key, slot);
+		if (key.objectid != logical) {
+			ret = btrfs_previous_item(root, path, 0,
+			                          BTRFS_EXTENT_ITEM_KEY);
+			if (ret < 0)
+				goto out;
+		}
+
+		while (1) {
+			l = path->nodes[0];
+			slot = path->slots[0];
+			if (slot >= btrfs_header_nritems(l)) {
+				ret = btrfs_next_leaf(root, path);
+				if (ret == 0)
+					continue;
+				if (ret < 0)
+					goto out;
+
+				break;
+			}
+			btrfs_item_key_to_cpu(l, &key, slot);
+
+			if (key.objectid + key.offset <= logical)
+				goto next;
+
+			if (key.objectid >= logical + map->stripe_len)
+				break;
+
+			if (btrfs_key_type(&key) != BTRFS_EXTENT_ITEM_KEY)
+				goto next;
+
+			extent = btrfs_item_ptr(l, slot,
+			                        struct btrfs_extent_item);
+			flags = btrfs_extent_flags(l, extent);
+			generation = btrfs_extent_generation(l, extent);
+
+			if (key.objectid < logical &&
+			    (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK)) {
+				printk(KERN_ERR
+				       "btrfs scrub: tree block %lld spanning "
+				       "stripes, ignored. logical=%lld\n",
+				       key.objectid, logical);
+				goto next;
+			}
+
+			/*
+			 * trim extent to this stripe
+			 */
+			if (key.objectid < logical) {
+				key.offset -= logical - key.objectid;
+				key.objectid = logical;
+			}
+			if (key.objectid + key.offset >
+			    logical + map->stripe_len) {
+				key.offset = logical + map->stripe_len -
+				             key.objectid;
+			}
+
+			ret = scrub_extent(sdev, key.objectid, key.offset,
+			                   key.objectid - logical + physical,
+			                   flags, generation, mirror_num);
+			if (ret)
+				goto out;
+next:
+			path->slots[0]++;
+		}
+		btrfs_release_path(root, path);
+		logical += increment;
+		physical += map->stripe_len;
+		spin_lock(&sdev->stat_lock);
+		sdev->stat.last_physical = physical;
+		spin_unlock(&sdev->stat_lock);
+	}
+	/* push queued extents */
+	scrub_submit(sdev);
+
+out:
+	btrfs_free_path(path);
+	return ret < 0 ? ret : 0;
+}
+
+static noinline_for_stack int scrub_chunk(struct scrub_dev *sdev, 
+	u64 chunk_tree, u64 chunk_objectid, u64 chunk_offset, u64 length)
+{
+	struct btrfs_mapping_tree *map_tree =
+		&sdev->dev->dev_root->fs_info->mapping_tree;
+	struct map_lookup *map;
+	struct extent_map *em;
+	int i;
+	int ret;
+
+	read_lock(&map_tree->map_tree.lock);
+	em = lookup_extent_mapping(&map_tree->map_tree, chunk_offset, 1);
+	read_unlock(&map_tree->map_tree.lock);
+
+	if (!em)
+		return -EINVAL;
+
+	map = (struct map_lookup *)em->bdev;
+	if (em->start != chunk_offset)
+		return -EINVAL;
+
+	if (em->len < length)
+		return -EINVAL;
+
+	for (i = 0; i < map->num_stripes; ++i) {
+		if (map->stripes[i].dev == sdev->dev) {
+			ret = scrub_stripe(sdev, map, i, chunk_offset, length);
+			if (ret)
+				return ret;
+		}
+	}
+	return 0;
+}
+
+static noinline_for_stack
+int scrub_enumerate_chunks(struct scrub_dev *sdev, u64 start, u64 end)
+{
+	struct btrfs_dev_extent *dev_extent = NULL;
+	struct btrfs_path *path;
+	struct btrfs_root *root = sdev->dev->dev_root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	u64 length;
+	u64 chunk_tree;
+	u64 chunk_objectid;
+	u64 chunk_offset;
+	int ret;
+	int slot;
+	struct extent_buffer *l;
+	struct btrfs_key key;
+	struct btrfs_key found_key;
+	struct btrfs_block_group_cache *cache;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	path->reada = 2;
+	path->search_commit_root = 1;
+	path->skip_locking = 1;
+
+	key.objectid = sdev->dev->devid;
+	key.offset = 0ull;
+	key.type = BTRFS_DEV_EXTENT_KEY;
+
+
+	while (1) {
+		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+		if (ret < 0)
+			goto out;
+		ret = 0;
+
+		l = path->nodes[0];
+		slot = path->slots[0];
+
+		btrfs_item_key_to_cpu(l, &found_key, slot);
+
+		if (found_key.objectid != sdev->dev->devid)
+			break;
+
+		if (btrfs_key_type(&key) != BTRFS_DEV_EXTENT_KEY)
+			break;
+
+		if (found_key.offset >= end)
+			break;
+
+		if (found_key.offset < key.offset)
+			break;
+
+		dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent);
+		length = btrfs_dev_extent_length(l, dev_extent);
+
+		if (found_key.offset + length <= start) {
+			key.offset = found_key.offset + length;
+			btrfs_release_path(root, path);
+			continue;
+		}
+
+		chunk_tree = btrfs_dev_extent_chunk_tree(l, dev_extent);
+		chunk_objectid = btrfs_dev_extent_chunk_objectid(l, dev_extent);
+		chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent);
+
+		/*
+		 * get a reference on the corresponding block group to prevent
+		 * the chunk from going away while we scrub it
+		 */
+		cache = btrfs_lookup_block_group(fs_info, chunk_offset);
+		if (!cache) {
+			ret = -ENOENT;
+			goto out;
+		}
+		ret = scrub_chunk(sdev, chunk_tree, chunk_objectid,
+		                  chunk_offset, length);
+		btrfs_put_block_group(cache);
+		if (ret)
+			break;
+
+		key.offset = found_key.offset + length;
+		btrfs_release_path(root, path);
+	}
+
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+static noinline_for_stack int scrub_supers(struct scrub_dev *sdev)
+{
+	int	i;
+	u64	bytenr;
+	u64	gen;
+	int	ret;
+	struct btrfs_device *device = sdev->dev;
+	struct btrfs_root *root = device->dev_root;
+
+	gen = root->fs_info->last_trans_committed;
+
+	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
+		bytenr = btrfs_sb_offset(i);
+		if (bytenr + BTRFS_SUPER_INFO_SIZE >= device->total_bytes)
+			break;
+
+		ret = scrub_page(sdev, bytenr, PAGE_SIZE, bytenr, 
+		                 BTRFS_EXTENT_FLAG_SUPER, gen, i, NULL, 1);
+		if (ret)
+			return ret;
+	}
+	wait_event(sdev->list_wait, atomic_read(&sdev->in_flight) == 0);
+
+	return 0;
+}
+
+/*
+ * get a reference count on fs_info->scrub_workers. start worker if necessary
+ */
+static noinline_for_stack int scrub_workers_get(struct btrfs_root *root)
+{
+	struct btrfs_fs_info *fs_info = root->fs_info;
+
+	mutex_lock(&fs_info->scrub_lock);
+	if (fs_info->scrub_workers_refcnt == 0) {
+#ifdef SCRUB_BTRFS_WORKER
+		btrfs_start_workers(&fs_info->scrub_workers, 1);
+#else
+		fs_info->scrub_workers = create_workqueue("scrub");
+		if (!fs_info->scrub_workers) {
+			mutex_unlock(&fs_info->scrub_lock);
+			return -ENOMEM;
+		}
+#endif
+	}
+	++fs_info->scrub_workers_refcnt;
+	mutex_unlock(&fs_info->scrub_lock);
+
+	return 0;
+}
+
+static noinline_for_stack void scrub_workers_put(struct btrfs_root *root)
+{
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	
+	mutex_lock(&fs_info->scrub_lock);
+	if (--fs_info->scrub_workers_refcnt == 0) {
+#ifdef SCRUB_BTRFS_WORKER
+		btrfs_stop_workers(&fs_info->scrub_workers);
+#else
+		destroy_workqueue(fs_info->scrub_workers);
+		fs_info->scrub_workers = NULL;
+#endif
+
+	}
+	WARN_ON(fs_info->scrub_workers_refcnt < 0);
+	mutex_unlock(&fs_info->scrub_lock);
+}
+
+
+int btrfs_scrub_dev(struct btrfs_root *root, u64 devid, u64 start, u64 end,
+                    struct btrfs_scrub_progress *progress)
+{
+	struct scrub_dev *sdev;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	int ret;
+	struct btrfs_device *dev;
+
+	if (root->fs_info->closing)
+		return -EINVAL;
+
+	/*
+	 * check some assumptions
+	 */
+	if (root->sectorsize != PAGE_SIZE ||
+	    root->sectorsize != root->leafsize ||
+	    root->sectorsize != root->nodesize) {
+		printk(KERN_ERR "btrfs_scrub: size assumptions fail\n");
+		return -EINVAL;
+	}
+	    
+	ret = scrub_workers_get(root);
+	if (ret)
+		return ret;
+
+	mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
+	dev = btrfs_find_device(root, devid, NULL, NULL);
+	if (!dev || dev->missing) {
+		mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
+		scrub_workers_put(root);
+		return -ENODEV;
+	}
+
+	mutex_lock(&fs_info->scrub_lock);
+	if (dev->scrub_device) {
+		mutex_unlock(&fs_info->scrub_lock);
+		scrub_workers_put(root);
+		return -EINPROGRESS;
+	}
+	sdev = scrub_setup_dev(dev);
+	if (IS_ERR(sdev)) {
+		mutex_unlock(&fs_info->scrub_lock);
+		scrub_workers_put(root);
+		return PTR_ERR(sdev);
+	}
+	dev->scrub_device = sdev;
+
+	atomic_inc(&fs_info->scrubs_running);
+	mutex_unlock(&fs_info->scrub_lock);
+	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
+
+	down_read(&fs_info->scrub_super_lock);
+	ret = scrub_supers(sdev);
+	up_read(&fs_info->scrub_super_lock);
+
+	if (!ret)
+		ret = scrub_enumerate_chunks(sdev, start, end);
+
+	wait_event(sdev->list_wait, atomic_read(&sdev->in_flight) == 0);
+
+	mutex_lock(&fs_info->scrub_lock);
+	atomic_dec(&fs_info->scrubs_running);
+	mutex_unlock(&fs_info->scrub_lock);
+	wake_up(&fs_info->scrub_pause_wait);
+
+	if (progress)
+		memcpy(progress, &sdev->stat, sizeof(*progress));
+
+	mutex_lock(&fs_info->scrub_lock);
+	dev->scrub_device = NULL;
+	mutex_unlock(&fs_info->scrub_lock);
+
+	scrub_free_dev(sdev);
+	scrub_workers_put(root);
+
+	return ret;
+}
+
+int btrfs_scrub_pause(struct btrfs_root *root)
+{
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	mutex_lock(&fs_info->scrub_lock);
+	atomic_inc(&fs_info->scrub_pause_req);
+	while (atomic_read(&fs_info->scrubs_paused) !=
+	       atomic_read(&fs_info->scrubs_running)) {
+		mutex_unlock(&fs_info->scrub_lock);
+		wait_event(fs_info->scrub_pause_wait,
+			   atomic_read(&fs_info->scrubs_paused) ==
+			   atomic_read(&fs_info->scrubs_running));
+		mutex_lock(&fs_info->scrub_lock);
+	}
+	mutex_unlock(&fs_info->scrub_lock);
+
+	return 0;
+}
+
+int btrfs_scrub_continue(struct btrfs_root *root)
+{
+	struct btrfs_fs_info *fs_info = root->fs_info;
+
+	atomic_dec(&fs_info->scrub_pause_req);
+	wake_up(&fs_info->scrub_pause_wait);
+	return 0;
+}
+
+int btrfs_scrub_pause_super(struct btrfs_root *root)
+{
+	down_write(&root->fs_info->scrub_super_lock);
+	return 0;
+}
+
+int btrfs_scrub_continue_super(struct btrfs_root *root)
+{
+	up_write(&root->fs_info->scrub_super_lock);
+	return 0;
+}
+
+int btrfs_scrub_cancel(struct btrfs_root *root)
+{
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	mutex_lock(&fs_info->scrub_lock);
+	if (!atomic_read(&fs_info->scrubs_running)) {
+		mutex_unlock(&fs_info->scrub_lock);
+		return -ENOTCONN;
+	}
+
+	atomic_inc(&fs_info->scrub_cancel_req);
+	while(atomic_read(&fs_info->scrubs_running)) {
+		mutex_unlock(&fs_info->scrub_lock);
+		wait_event(fs_info->scrub_pause_wait,
+			   atomic_read(&fs_info->scrubs_running) == 0);
+		mutex_lock(&fs_info->scrub_lock);
+	}
+	atomic_dec(&fs_info->scrub_cancel_req);
+	mutex_unlock(&fs_info->scrub_lock);
+	
+	return 0;
+}
+
+int btrfs_scrub_cancel_dev(struct btrfs_root *root, struct btrfs_device *dev)
+{
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct scrub_dev *sdev;
+
+	mutex_lock(&fs_info->scrub_lock);
+	sdev = dev->scrub_device;
+	if (!sdev) {
+		mutex_unlock(&fs_info->scrub_lock);
+		return -ENOTCONN;
+	}
+	atomic_inc(&sdev->cancel_req);
+	while(dev->scrub_device) {
+		mutex_unlock(&fs_info->scrub_lock);
+		wait_event(fs_info->scrub_pause_wait,
+		           dev->scrub_device == NULL);
+		mutex_lock(&fs_info->scrub_lock);
+	}
+	mutex_unlock(&fs_info->scrub_lock);
+		
+	return 0;
+}
+int btrfs_scrub_cancel_devid(struct btrfs_root *root, u64 devid)
+{
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_device *dev;
+	int ret;
+
+	/*
+	 * we have to hold the device_list_mutex here so the device
+	 * does not go away in cancel_dev. FIXME: find a better solution
+	 */
+	mutex_lock(&fs_info->fs_devices->device_list_mutex);
+	dev = btrfs_find_device(root, devid, NULL, NULL);
+	if (!dev) {
+		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+		return -ENODEV;
+	}
+	ret = btrfs_scrub_cancel_dev(root, dev);
+	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+
+	return ret;
+}
+	
+int btrfs_scrub_progress(struct btrfs_root *root, u64 devid,
+                         struct btrfs_scrub_progress *progress)
+{
+	struct btrfs_device *dev;
+	struct scrub_dev *sdev = NULL;
+
+	mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
+	dev = btrfs_find_device(root, devid, NULL, NULL);
+	if (dev)
+		sdev = dev->scrub_device;
+	if (sdev)
+		memcpy(progress, &sdev->stat, sizeof(*progress));
+	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
+
+	return dev ? (sdev ? 0 : -ENOTCONN) : -ENODEV;
+}
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 4/6] btrfs: sync scrub with commit & device removal
  2011-03-11 14:49 [PATCH v2 0/6] btrfs: scrub Arne Jansen
                   ` (2 preceding siblings ...)
  2011-03-11 14:49 ` [PATCH v2 3/6] btrfs: add scrub code and prototypes Arne Jansen
@ 2011-03-11 14:49 ` Arne Jansen
  2011-03-11 14:49 ` [PATCH v2 5/6] btrfs: add state information for scrub Arne Jansen
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Arne Jansen @ 2011-03-11 14:49 UTC (permalink / raw)
  To: chris.mason, linux-btrfs, jansen

This adds several synchronizations:
 - for a transaction commit, the scrub gets paused before the
   tree roots are committed until the super are safely on disk
 - during a log commit, scrubbing of supers is disabled
 - on unmount, the scrub gets cancelled
 - on device removal, the scrub for the particular device gets cancelled

Signed-off-by: Arne Jansen <sensille@gmx.net>
---
 fs/btrfs/disk-io.c     |    1 +
 fs/btrfs/transaction.c |    3 +++
 fs/btrfs/tree-log.c    |    2 ++
 fs/btrfs/volumes.c     |    2 ++
 4 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3e1ea3e..924a366 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2493,6 +2493,7 @@ int close_ctree(struct btrfs_root *root)
 	fs_info->closing = 1;
 	smp_mb();
 
+	btrfs_scrub_cancel(root);
 	btrfs_put_block_group_cache(fs_info);
 
 	/*
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 3d73c8d..5a43b20 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1310,6 +1310,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 
 	WARN_ON(cur_trans != trans->transaction);
 
+	btrfs_scrub_pause(root);
 	/* btrfs_commit_tree_roots is responsible for getting the
 	 * various roots consistent with each other.  Every pointer
 	 * in the tree of tree roots has to point to the most up to date
@@ -1391,6 +1392,8 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 
 	mutex_unlock(&root->fs_info->trans_mutex);
 
+	btrfs_scrub_continue(root);
+
 	if (current->journal_info == trans)
 		current->journal_info = NULL;
 
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 1f6788f..2be84fa 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2098,7 +2098,9 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	 * the running transaction open, so a full commit can't hop
 	 * in and cause problems either.
 	 */
+	btrfs_scrub_pause_super(root);
 	write_ctree_super(trans, root->fs_info->tree_root, 1);
+	btrfs_scrub_continue_super(root);
 	ret = 0;
 
 	mutex_lock(&root->log_mutex);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 7dc9fa5..ad3ea88 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1330,6 +1330,8 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path)
 		goto error_undo;
 
 	device->in_fs_metadata = 0;
+	smp_mb();
+	btrfs_scrub_cancel_dev(root, device);
 
 	/*
 	 * the device list mutex makes sure that we don't change
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 5/6] btrfs: add state information for scrub
  2011-03-11 14:49 [PATCH v2 0/6] btrfs: scrub Arne Jansen
                   ` (3 preceding siblings ...)
  2011-03-11 14:49 ` [PATCH v2 4/6] btrfs: sync scrub with commit & device removal Arne Jansen
@ 2011-03-11 14:49 ` Arne Jansen
  2011-03-11 16:53   ` David Sterba
  2011-03-11 14:49 ` [PATCH v2 6/6] btrfs: new ioctls " Arne Jansen
  2011-03-11 16:17 ` [PATCH v2 0/6] btrfs: scrub Ric Wheeler
  6 siblings, 1 reply; 15+ messages in thread
From: Arne Jansen @ 2011-03-11 14:49 UTC (permalink / raw)
  To: chris.mason, linux-btrfs, jansen

Add structures and state information needed for scrub

Signed-off-by: Arne Jansen <sensille@gmx.net>
---
 fs/btrfs/ctree.h   |   26 ++++++++++++++++++++++++++
 fs/btrfs/disk-io.c |   15 +++++++++++++++
 fs/btrfs/ioctl.h   |   17 +++++++++++++++++
 fs/btrfs/volumes.h |    3 +++
 4 files changed, 61 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 030c321..3584179 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -23,6 +23,7 @@
 #include <linux/mm.h>
 #include <linux/highmem.h>
 #include <linux/fs.h>
+#include <linux/rwsem.h>
 #include <linux/completion.h>
 #include <linux/backing-dev.h>
 #include <linux/wait.h>
@@ -32,6 +33,7 @@
 #include "extent_io.h"
 #include "extent_map.h"
 #include "async-thread.h"
+#include "ioctl.h"
 
 struct btrfs_trans_handle;
 struct btrfs_transaction;
@@ -48,6 +50,8 @@ struct btrfs_ordered_sum;
 
 #define BTRFS_COMPAT_EXTENT_TREE_V0
 
+#define SCRUB_BTRFS_WORKER
+
 /*
  * files bigger than this get some pre-flushing when they are added
  * to the ordered operations list.  That way we limit the total
@@ -508,6 +512,12 @@ struct btrfs_extent_item_v0 {
 /* use full backrefs for extent pointers in the block */
 #define BTRFS_BLOCK_FLAG_FULL_BACKREF	(1ULL << 8)
 
+/*
+ * this flag is only used internally by scrub and may be changed at any time
+ * it is only declared here to avoid collisions
+ */
+#define BTRFS_EXTENT_FLAG_SUPER		(1ULL << 48)
+
 struct btrfs_tree_block_info {
 	struct btrfs_disk_key key;
 	u8 level;
@@ -1067,6 +1077,22 @@ struct btrfs_fs_info {
 
 	void *bdev_holder;
 
+	/* private scrub information */
+	struct mutex scrub_lock;
+	struct scrub_info *scrub_info;
+	atomic_t scrubs_running;
+	atomic_t scrub_pause_req;
+	atomic_t scrubs_paused;
+	atomic_t scrub_cancel_req;
+	wait_queue_head_t scrub_pause_wait;
+	struct rw_semaphore scrub_super_lock;
+	int scrub_workers_refcnt;
+#ifdef SCRUB_BTRFS_WORKER
+	struct btrfs_workers scrub_workers;
+#else
+	struct workqueue_struct *scrub_workers;
+#endif
+
 	/* filesystem state */
 	u64 fs_state;
 };
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 924a366..4d62bc3 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1677,6 +1677,21 @@ struct btrfs_root *open_ctree(struct super_block *sb,
 	INIT_LIST_HEAD(&fs_info->ordered_extents);
 	spin_lock_init(&fs_info->ordered_extent_lock);
 
+	mutex_init(&fs_info->scrub_lock);
+	atomic_set(&fs_info->scrubs_running, 0);
+	atomic_set(&fs_info->scrub_pause_req, 0);
+	atomic_set(&fs_info->scrubs_paused, 0);
+	atomic_set(&fs_info->scrub_cancel_req, 0);
+	init_waitqueue_head(&fs_info->scrub_pause_wait);
+	init_rwsem(&fs_info->scrub_super_lock);
+	fs_info->scrub_workers_refcnt = 0;
+#ifdef SCRUB_BTRFS_WORKER
+	btrfs_init_workers(&fs_info->scrub_workers, "scrub",
+			   fs_info->thread_pool_size, &fs_info->generic_worker);
+#else
+	fs_info->scrub_workers = NULL;
+#endif
+
 	sb->s_blocksize = 4096;
 	sb->s_blocksize_bits = blksize_bits(4096);
 	sb->s_bdi = &fs_info->bdi;
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index 8fb3821..973e7c8 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -42,6 +42,23 @@ struct btrfs_ioctl_vol_args_v2 {
 	char name[BTRFS_SUBVOL_NAME_MAX + 1];
 };
 
+struct btrfs_scrub_progress {
+	__u64 data_extents_scrubbed;
+	__u64 tree_extents_scrubbed;
+	__u64 data_bytes_scrubbed;
+	__u64 tree_bytes_scrubbed;
+	__u64 read_errors;
+	__u64 csum_errors;
+	__u64 verify_errors;
+	__u64 no_csum;
+	__u64 csum_discards;
+	__u64 super_errors;
+	__u64 malloc_errors;
+	__u64 uncorrectable_errors;
+	__u64 corrected_errors;
+	__u64 last_physical;
+};
+
 #define BTRFS_INO_LOOKUP_PATH_MAX 4080
 struct btrfs_ioctl_ino_lookup_args {
 	__u64 treeid;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 0ccc982..92204d9 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -86,6 +86,9 @@ struct btrfs_device {
 	/* physical drive uuid (or lvm uuid) */
 	u8 uuid[BTRFS_UUID_SIZE];
 
+	/* per-device scrub information */
+	struct scrub_dev *scrub_device;
+
 	struct btrfs_work work;
 };
 
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 6/6] btrfs: new ioctls for scrub
  2011-03-11 14:49 [PATCH v2 0/6] btrfs: scrub Arne Jansen
                   ` (4 preceding siblings ...)
  2011-03-11 14:49 ` [PATCH v2 5/6] btrfs: add state information for scrub Arne Jansen
@ 2011-03-11 14:49 ` Arne Jansen
  2011-03-11 16:17 ` [PATCH v2 0/6] btrfs: scrub Ric Wheeler
  6 siblings, 0 replies; 15+ messages in thread
From: Arne Jansen @ 2011-03-11 14:49 UTC (permalink / raw)
  To: chris.mason, linux-btrfs, jansen

From: Jan Schmidt <list.btrfs@jan-o-sch.net>

adds ioctls necessary to start and cancel scrubs, to get current
progress and to get info about devices to be scrubbed.
Note that the scrub is done per-device and that the ioctl only
returns after the scrub for this devices is finished or has been
canceled.

Signed-off-by: Arne Jansen <sensille@gmx.net>
---
 fs/btrfs/ctree.h |    2 -
 fs/btrfs/ioctl.c |  131 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/ioctl.h |   38 ++++++++++++++++
 3 files changed, 169 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 3584179..896fe86 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -189,7 +189,6 @@ struct btrfs_mapping_tree {
 	struct extent_map_tree map_tree;
 };
 
-#define BTRFS_UUID_SIZE 16
 struct btrfs_dev_item {
 	/* the internal btrfs device id */
 	__le64 devid;
@@ -296,7 +295,6 @@ static inline unsigned long btrfs_chunk_item_size(int num_stripes)
 		sizeof(struct btrfs_stripe) * (num_stripes - 1);
 }
 
-#define BTRFS_FSID_SIZE 16
 #define BTRFS_HEADER_FLAG_WRITTEN	(1ULL << 0)
 #define BTRFS_HEADER_FLAG_RELOC		(1ULL << 1)
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 5fdb2ab..534e87e 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1718,6 +1718,75 @@ static long btrfs_ioctl_rm_dev(struct btrfs_root *root, void __user *arg)
 	return ret;
 }
 
+static long btrfs_ioctl_fs_info(struct btrfs_root *root, void __user *arg)
+{
+	struct btrfs_ioctl_fs_info_args fi_args;
+	struct btrfs_device *device;
+	struct btrfs_device *next;
+	struct btrfs_fs_devices *fs_devices = root->fs_info->fs_devices;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	fi_args.num_devices = fs_devices->num_devices;
+	fi_args.max_id = 0;
+	memcpy(&fi_args.fsid, root->fs_info->fsid, sizeof(fi_args.fsid));
+
+	mutex_lock(&fs_devices->device_list_mutex);
+	list_for_each_entry_safe(device, next, &fs_devices->devices, dev_list) {
+		if (device->devid > fi_args.max_id)
+			fi_args.max_id = device->devid;
+	}
+	mutex_unlock(&fs_devices->device_list_mutex);
+
+	if (copy_to_user(arg, &fi_args, sizeof(fi_args)))
+		return -EFAULT;
+
+	return 0;
+}
+
+static long btrfs_ioctl_dev_info(struct btrfs_root *root, void __user *arg)
+{
+	struct btrfs_ioctl_dev_info_args *di_args;
+	struct btrfs_device *dev;
+	struct btrfs_fs_devices *fs_devices = root->fs_info->fs_devices;
+	int ret = 0;
+	char *s_uuid = NULL;
+	char empty_uuid[BTRFS_UUID_SIZE] = {0};
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	di_args = memdup_user(arg, sizeof(*di_args));
+	if (IS_ERR(di_args))
+		return PTR_ERR(di_args);
+
+	if (memcmp(empty_uuid, di_args->uuid, BTRFS_UUID_SIZE) != 0)
+		s_uuid = di_args->uuid;
+
+	mutex_lock(&fs_devices->device_list_mutex);
+	dev = btrfs_find_device(root, di_args->devid, s_uuid, NULL);
+	mutex_unlock(&fs_devices->device_list_mutex);
+
+	if (!dev) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	di_args->devid = dev->devid;
+	di_args->bytes_used = dev->bytes_used;
+	di_args->total_bytes = dev->total_bytes;
+	memcpy(di_args->uuid, dev->uuid, sizeof(di_args->uuid));
+	strncpy(di_args->path, dev->name, sizeof(di_args->path));
+
+out:
+	if (ret == 0 && copy_to_user(arg, di_args, sizeof(*di_args)))
+		ret = -EFAULT;
+
+	kfree(di_args);
+	return ret;
+}
+
 static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
 				       u64 off, u64 olen, u64 destoff)
 {
@@ -2375,6 +2444,58 @@ static noinline long btrfs_ioctl_wait_sync(struct file *file, void __user *argp)
 	return btrfs_wait_for_commit(root, transid);
 }
 
+static long btrfs_ioctl_scrub(struct btrfs_root *root, void __user *arg)
+{
+	int ret;
+	struct btrfs_ioctl_scrub_args *sa;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	sa = memdup_user(arg, sizeof(*sa));
+	if (IS_ERR(sa))
+		return PTR_ERR(sa);
+
+	ret = btrfs_scrub_dev(root, sa->devid, sa->start, sa->end,
+	                      &sa->progress);
+
+	if (copy_to_user(arg, sa, sizeof(*sa)))
+		ret = -EFAULT;
+
+	kfree(sa);
+	return ret;
+}
+
+static long btrfs_ioctl_scrub_cancel(struct btrfs_root *root, void __user *arg)
+{
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	return btrfs_scrub_cancel(root);
+}
+
+static long btrfs_ioctl_scrub_progress(struct btrfs_root *root,
+                                       void __user *arg)
+{
+	struct btrfs_ioctl_scrub_args *sa;
+	int ret;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	sa = memdup_user(arg, sizeof(*sa));
+	if (IS_ERR(sa))
+		return PTR_ERR(sa);
+
+	ret = btrfs_scrub_progress(root, sa->devid, &sa->progress);
+
+	if (copy_to_user(arg, sa, sizeof(*sa)))
+		ret = -EFAULT;
+
+	kfree(sa);
+	return ret;
+}
+
 long btrfs_ioctl(struct file *file, unsigned int
 		cmd, unsigned long arg)
 {
@@ -2412,6 +2533,10 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_add_dev(root, argp);
 	case BTRFS_IOC_RM_DEV:
 		return btrfs_ioctl_rm_dev(root, argp);
+	case BTRFS_IOC_FS_INFO:
+		return btrfs_ioctl_fs_info(root, argp);
+	case BTRFS_IOC_DEV_INFO:
+		return btrfs_ioctl_dev_info(root, argp);
 	case BTRFS_IOC_BALANCE:
 		return btrfs_balance(root->fs_info->dev_root);
 	case BTRFS_IOC_CLONE:
@@ -2435,6 +2560,12 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_start_sync(file, argp);
 	case BTRFS_IOC_WAIT_SYNC:
 		return btrfs_ioctl_wait_sync(file, argp);
+	case BTRFS_IOC_SCRUB:
+		return btrfs_ioctl_scrub(root, argp);
+	case BTRFS_IOC_SCRUB_CANCEL:
+		return btrfs_ioctl_scrub_cancel(root, argp);
+	case BTRFS_IOC_SCRUB_PROGRESS:
+		return btrfs_ioctl_scrub_progress(root, argp);
 	}
 
 	return -ENOTTY;
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index 973e7c8..52bcd02 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -32,6 +32,8 @@ struct btrfs_ioctl_vol_args {
 
 #define BTRFS_SUBVOL_CREATE_ASYNC	(1ULL << 0)
 #define BTRFS_SUBVOL_RDONLY		(1ULL << 1)
+#define BTRFS_FSID_SIZE 16
+#define BTRFS_UUID_SIZE 16
 
 #define BTRFS_SUBVOL_NAME_MAX 4039
 struct btrfs_ioctl_vol_args_v2 {
@@ -59,6 +61,33 @@ struct btrfs_scrub_progress {
 	__u64 last_physical;
 };
 
+struct btrfs_ioctl_scrub_args {
+	__u64 devid;				/* in */
+	__u64 start;				/* in */
+	__u64 end;				/* in */
+	__u64 flags;				/* in */
+	struct btrfs_scrub_progress progress;	/* out */
+	/* pad to 1k */
+	__u64 unused[(1024-32-sizeof(struct btrfs_scrub_progress))/8];
+};
+
+#define BTRFS_DEVICE_PATH_NAME_MAX 1024
+struct btrfs_ioctl_dev_info_args {
+	__u64 devid;				/* in/out */
+	__u8 uuid[BTRFS_UUID_SIZE];		/* in/out */
+	__u64 bytes_used;			/* out */
+	__u64 total_bytes;			/* out */
+	__u64 unused[379];			/* pad to 4k */
+	__u8 path[BTRFS_DEVICE_PATH_NAME_MAX];	/* out */
+};
+
+struct btrfs_ioctl_fs_info_args {
+	__u64 max_id;				/* out */
+	__u64 num_devices;			/* out */
+	__u8 fsid[BTRFS_FSID_SIZE];		/* out */
+	__u64 reserved[124];			/* pad to 1k */
+};
+
 #define BTRFS_INO_LOOKUP_PATH_MAX 4080
 struct btrfs_ioctl_ino_lookup_args {
 	__u64 treeid;
@@ -220,4 +249,13 @@ struct btrfs_ioctl_space_args {
 				   struct btrfs_ioctl_vol_args_v2)
 #define BTRFS_IOC_SUBVOL_GETFLAGS _IOW(BTRFS_IOCTL_MAGIC, 25, __u64)
 #define BTRFS_IOC_SUBVOL_SETFLAGS _IOW(BTRFS_IOCTL_MAGIC, 26, __u64)
+#define BTRFS_IOC_SCRUB _IOWR(BTRFS_IOCTL_MAGIC, 27, \
+                             struct btrfs_ioctl_scrub_args)
+#define BTRFS_IOC_SCRUB_CANCEL _IO(BTRFS_IOCTL_MAGIC, 28)
+#define BTRFS_IOC_SCRUB_PROGRESS _IOWR(BTRFS_IOCTL_MAGIC, 29, \
+                             struct btrfs_ioctl_scrub_args)
+#define BTRFS_IOC_DEV_INFO _IOWR(BTRFS_IOCTL_MAGIC, 30, \
+                                 struct btrfs_ioctl_dev_info_args)
+#define BTRFS_IOC_FS_INFO _IOR(BTRFS_IOCTL_MAGIC, 31, \
+                                 struct btrfs_ioctl_fs_info_args)
 #endif
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/6] btrfs: scrub
  2011-03-11 14:49 [PATCH v2 0/6] btrfs: scrub Arne Jansen
                   ` (5 preceding siblings ...)
  2011-03-11 14:49 ` [PATCH v2 6/6] btrfs: new ioctls " Arne Jansen
@ 2011-03-11 16:17 ` Ric Wheeler
  2011-03-12 13:20   ` Arne Jansen
  6 siblings, 1 reply; 15+ messages in thread
From: Ric Wheeler @ 2011-03-11 16:17 UTC (permalink / raw)
  To: Arne Jansen; +Cc: chris.mason, linux-btrfs, jansen

On 03/11/2011 09:49 AM, Arne Jansen wrote:
> This series adds an initial implementation for scrub. It works quite
> straightforward. The usermode issues an ioctl for each device in the
> fs. For each device, it enumerates the allocated device chunks. For
> each chunk, the contained extents are enumerated and the data checksums
> fetched. The extents are read sequentially and the checksums verified.
> If an error occurs (checksum or EIO), a good copy is searched for. If
> one is found, the bad copy will be rewritten.
> All enumerations happen from the commit roots. During a transaction
> commit, the scrubs get paused and afterwards continue from the new
> roots.
> For future improvements please see the inline comments.
>
> The accompanying user mode patches will follow shortly.
>
> This v2 mainly changes the dev_info ioctl interface.
>
> Thanks,
> Arne
>

Great work!

I do wonder if we should also worry about the unallocated part of the storage 
device. Often, with local disks specifically, your unused space might accumulate 
errors over time.

What you add to your scrub phase is a simple read operation for the unallocated 
ranges (optionally a READ_VERIFY which validates the data on platter without 
transferring data over the bus to the host).

The recovery operation here would be to write (zeros) to the block if an error 
is detected, so we might be pessimistic and simply use write to "zero" those 
unallocated ranges as well.  Note that there are "WRITE_SAME" commands that RAID 
people use for initializing unused drives for example.

I would not run the overwrite or read check on SSD or arrays so this would be an 
optional type of scrub I suppose.

Regards,

Ric

> Arne Jansen (5):
>    btrfs: add parameter to btrfs_lookup_csum_range
>    btrfs: make struct map_lookup public
>    btrfs: add scrub code and prototypes
>    btrfs: sync scrub with commit&  device removal
>    btrfs: add state information for scrub
>
> Jan Schmidt (1):
>    btrfs: new ioctls for scrub
>
>   fs/btrfs/Makefile      |    2 +-
>   fs/btrfs/ctree.h       |   46 ++-
>   fs/btrfs/disk-io.c     |   16 +
>   fs/btrfs/file-item.c   |    8 +-
>   fs/btrfs/inode.c       |    2 +-
>   fs/btrfs/ioctl.c       |  131 +++++
>   fs/btrfs/ioctl.h       |   55 ++
>   fs/btrfs/relocation.c  |    2 +-
>   fs/btrfs/scrub.c       | 1463 ++++++++++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/transaction.c |    3 +
>   fs/btrfs/tree-log.c    |    6 +-
>   fs/btrfs/volumes.c     |   16 +-
>   fs/btrfs/volumes.h     |   17 +
>   13 files changed, 1743 insertions(+), 24 deletions(-)
>   create mode 100644 fs/btrfs/scrub.c
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/6] btrfs: add scrub code and prototypes
  2011-03-11 14:49 ` [PATCH v2 3/6] btrfs: add scrub code and prototypes Arne Jansen
@ 2011-03-11 16:34   ` David Sterba
  2011-03-12 10:54     ` Arne Jansen
  0 siblings, 1 reply; 15+ messages in thread
From: David Sterba @ 2011-03-11 16:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: sensille

On Fri, Mar 11, 2011 at 03:49:40PM +0100, Arne Jansen wrote:
> This is the main scrub code.
> 
> Signed-off-by: Arne Jansen <sensille@gmx.net>
> ---
>  fs/btrfs/Makefile |    2 +-
>  fs/btrfs/ctree.h  |   14 +
>  fs/btrfs/scrub.c  | 1463 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 1478 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index 31610ea..8fda313 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -7,4 +7,4 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
>  	   extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \
>  	   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
>  	   export.o tree-log.o acl.o free-space-cache.o zlib.o lzo.o \
> -	   compression.o delayed-ref.o relocation.o
> +	   compression.o delayed-ref.o relocation.o scrub.o
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 4c99834..030c321 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2610,4 +2610,18 @@ void btrfs_reloc_pre_snapshot(struct btrfs_trans_handle *trans,
>  			      u64 *bytes_to_reserve);
>  void btrfs_reloc_post_snapshot(struct btrfs_trans_handle *trans,
>  			      struct btrfs_pending_snapshot *pending);
> +
> +/* scrub.c */
> +int btrfs_scrub_dev(struct btrfs_root *root, u64 devid, u64 start, u64 end,
> +                    struct btrfs_scrub_progress *progress);
> +int btrfs_scrub_pause(struct btrfs_root *root);
> +int btrfs_scrub_pause_super(struct btrfs_root *root);
> +int btrfs_scrub_continue(struct btrfs_root *root);
> +int btrfs_scrub_continue_super(struct btrfs_root *root);
> +int btrfs_scrub_cancel(struct btrfs_root *root);
> +int btrfs_scrub_cancel_dev(struct btrfs_root *root, struct btrfs_device *dev);
> +int btrfs_scrub_cancel_devid(struct btrfs_root *root, u64 devid);
> +int btrfs_scrub_progress(struct btrfs_root *root, u64 devid,
> +                         struct btrfs_scrub_progress *progress);
> +
>  #endif
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> new file mode 100644
> index 0000000..d606f4d
> --- /dev/null
> +++ b/fs/btrfs/scrub.c
> @@ -0,0 +1,1463 @@
> +/*
> + * Copyright (C) 2011 STRATO.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public
> + * License v2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; if not, write to the
> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> + * Boston, MA 021110-1307, USA.
> + */
> +
> +#include <linux/sched.h>
> +#include <linux/pagemap.h>
> +#include <linux/writeback.h>
> +#include <linux/blkdev.h>
> +#include <linux/rbtree.h>
> +#include <linux/slab.h>
> +#include <linux/workqueue.h>
> +#include "ctree.h"
> +#include "volumes.h"
> +#include "disk-io.h"
> +#include "ordered-data.h"
> +
> +/*
> + * This is only the first step towards a full-features scrub. It reads all
> + * extent and super block and verifies the checksums. In case a bad checksum
> + * is found or the extent cannot be read, good data will be written back if
> + * any can be found.
> + *
> + * Future enhancements:
> + *  - To enhance the performance, better read-ahead strategies for the
> + *    extent-tree can be employed.
> + *  - In case an unrepairable extent is encountered, track which files are
> + *    affected and report them
> + *  - In case of a read error on files with nodatasum, map the file and read
> + *    the extent to trigger a writeback of the good copy
> + *  - track and record media errors, throw out bad devices
> + *  - add a readonly mode
> + *  - add a mode to also read unallocated space
> + */
> +
> +#ifdef SCRUB_BTRFS_WORKER
> +typedef struct btrfs_work scrub_work_t;
> +#define SCRUB_INIT_WORK(work, fn) do { (work)->func = (fn); } while (0)
> +#define SCRUB_QUEUE_WORK(wq, w) do { btrfs_queue_worker(&(wq), w); } while (0)
> +#else
> +typedef struct work_struct scrub_work_t;
> +#define SCRUB_INIT_WORK INIT_WORK
> +#define SCRUB_QUEUE_WORK queue_work
> +#endif
> +
> +struct scrub_bio;
> +struct scrub_page;
> +struct scrub_dev;
> +struct scrub_fixup;
> +static void scrub_bio_end_io(struct bio *bio, int err);
> +static void scrub_checksum(scrub_work_t *work);
> +static int scrub_checksum_data(struct scrub_dev *sdev,
> +                               struct scrub_page *spag, void *buffer);
> +static int scrub_checksum_tree_block(struct scrub_dev *sdev,
> +                                     struct scrub_page *spag, u64 logical,
> +                                     void *buffer);
> +static int scrub_checksum_super(struct scrub_bio *sbio, void *buffer);
> +static void scrub_recheck_end_io(struct bio *bio, int err);
> +static void scrub_fixup_worker(scrub_work_t *work);
> +static void scrub_fixup(struct scrub_fixup *fixup);
> +
> +#define SCRUB_PAGES_PER_BIO	16	/* 64k per bio */
> +#define SCRUB_BIOS_PER_DEV	16	/* 1 MB per device in flight */
> +
> +struct scrub_page {
> +	u64			flags;  /* extent flags */
> +	u64			generation;
> +	u64			mirror_num;
> +	int			have_csum;
> +	u8			csum[BTRFS_CSUM_SIZE];
> +};
> +
> +struct scrub_bio {
> +	int			index;
> +	struct scrub_dev	*sdev;
> +	struct bio		*bio;
> +	int			err;
> +	u64			logical;
> +	u64			physical;
> +	struct scrub_page	spag[SCRUB_PAGES_PER_BIO];
> +	u64			count;
> +	int			next_free;
> +	scrub_work_t		work;
> +};
> +
> +struct scrub_dev {
> +	struct scrub_bio	bios[SCRUB_BIOS_PER_DEV];

sizeof(struct scrub_bio) == 1160
SCRUB_BIOS_PER_DEV == 16

> +	struct btrfs_device	*dev;
> +	int			first_free;
> +	int			curr;
> +	atomic_t		in_flight;
> +	spinlock_t		list_lock;
> +	wait_queue_head_t	list_wait;
> +	u16			csum_size;
> +	struct list_head	csum_list;
> +	atomic_t		cancel_req;
> +	/*
> +	 * statistics
> +	 */
> +	struct btrfs_scrub_progress stat;
> +	spinlock_t		stat_lock;
> +};

sizeof(struct scrub_dev) == 18760 on an x86_64, an order 3 allocation in
scrub_setup_dev()

> +
> +struct scrub_fixup {
> +	struct scrub_dev	*sdev;
> +	struct bio		*bio;
> +	u64			logical;
> +	u64			physical;
> +	struct scrub_page	spag;
> +	scrub_work_t		work;
> +	int			err;
> +	int			recheck;
> +};
> +
> +static void scrub_free_csums(struct scrub_dev *sdev)
> +{
> +	while(!list_empty(&sdev->csum_list)) {
> +		struct btrfs_ordered_sum *sum;
> +		sum = list_first_entry(&sdev->csum_list,
> +		                       struct btrfs_ordered_sum, list);
> +		list_del(&sum->list);
> +		kfree(sum);
> +	}
> +}
> +
> +static noinline_for_stack void scrub_free_dev(struct scrub_dev *sdev)
> +{
> +	int i;
> +	int j;
> +	struct page *last_page;
> +
> +	if (!sdev)
> +		return;
> +
> +	for (i = 0; i < SCRUB_BIOS_PER_DEV; ++i) {
> +		struct bio *bio = sdev->bios[i].bio;
> +		if (bio)
                   ^^^^^
stop when we found something to free?


> +			break;
> +		
> +		last_page = NULL;
> +		for (j = 0; j < bio->bi_vcnt; ++j) {
                                ^^^
and dereference it.

> +			if (bio->bi_io_vec[i].bv_page == last_page)
> +				continue;
> +			last_page = bio->bi_io_vec[i].bv_page;
> +			__free_page(last_page);
> +		}
> +		bio_put(sdev->bios[i].bio);
> +	}
> +
> +	scrub_free_csums(sdev);
> +	kfree(sdev);
> +}
> +
> +static noinline_for_stack
> +struct scrub_dev *scrub_setup_dev(struct btrfs_device *dev)
> +{
> +	struct scrub_dev *sdev;
> +	int		i;
> +	int		j;
> +	int		ret;
> +	struct btrfs_fs_info *fs_info = dev->dev_root->fs_info;

(coding style expects a newline here)

> +	sdev = kzalloc(sizeof(*sdev), GFP_NOFS);
> +	if (!sdev)
> +		goto nomem;
> +	sdev->dev = dev;
> +	for (i = 0; i < SCRUB_BIOS_PER_DEV; ++i) {
> +		struct bio *bio;
> +
> +		bio = bio_alloc(GFP_NOFS, SCRUB_PAGES_PER_BIO);
> +		if (!bio)
> +			goto nomem;
> +
> +		sdev->bios[i].index = i;
> +		sdev->bios[i].sdev = sdev;
> +		sdev->bios[i].bio = bio;
> +		sdev->bios[i].count = 0;
> +		SCRUB_INIT_WORK(&sdev->bios[i].work, scrub_checksum);
> +		bio->bi_private = sdev->bios + i;
> +		bio->bi_end_io = scrub_bio_end_io;
> +		bio->bi_sector = 0;
> +		bio->bi_bdev = dev->bdev;
> +		bio->bi_size = 0;
> +
> +		for (j = 0; j < SCRUB_PAGES_PER_BIO; ++j) {
> +			struct page *page;
> +			page = alloc_page(GFP_NOFS);
> +			if (!page)
> +				goto nomem;
> +
> +			ret = bio_add_page(bio, page, PAGE_SIZE, 0);
> +			if (!ret)
> +				goto nomem;
> +		}
> +		WARN_ON(bio->bi_vcnt != SCRUB_PAGES_PER_BIO);
> +
> +		if (i != SCRUB_BIOS_PER_DEV-1)
> +			sdev->bios[i].next_free = i + 1;
> +		 else
> +			sdev->bios[i].next_free = -1;
> +	}
> +	sdev->first_free = 0;
> +	sdev->curr = -1;
> +	atomic_set(&sdev->in_flight, 0);
> +	atomic_set(&sdev->cancel_req, 0);
> +	sdev->csum_size = btrfs_super_csum_size(&fs_info->super_copy);
> +	INIT_LIST_HEAD(&sdev->csum_list);
> +	
> +	spin_lock_init(&sdev->list_lock);
> +	spin_lock_init(&sdev->stat_lock);
> +	init_waitqueue_head(&sdev->list_wait);
> +	return sdev;
> +
> +nomem:
> +	scrub_free_dev(sdev);

When taking the 'goto nomem' path, either all bios are leaked, or the
check in scrub_free_dev is buggy ...

> +	return ERR_PTR(-ENOMEM);
> +}
> +
> +/*
> + * scrub_recheck_error gets called when either verification of the page
> + * failed or the bio failed to read, e.g. with EIO. In the latter case,
> + * recheck_error gets called for every page in the bio, even though only
> + * one may be bad
> + */
> +static void scrub_recheck_error(struct scrub_bio *sbio, int ix)
> +{
> +	struct scrub_dev *sdev = sbio->sdev;
> +	struct btrfs_fs_info *fs_info = sdev->dev->dev_root->fs_info;
> +	struct bio *bio = NULL;
> +	struct page *page = NULL;
> +	struct scrub_fixup *fixup = NULL;
> +	int ret;
> +
> +	/*
> +	 * while we're in here we do not want the transaction to commit.
> +	 * To prevent it, we increment scrubs_running. scrub_pause will
> +	 * have to wait until we're finished
> +	 */
> +	mutex_lock(&fs_info->scrub_lock);
> +	atomic_inc(&fs_info->scrubs_running);
> +	mutex_unlock(&fs_info->scrub_lock);
> +
> +	fixup = kzalloc(sizeof(*fixup), GFP_NOFS);
> +	if (!fixup)
> +		goto malloc_error;
> +
> +	fixup->logical = sbio->logical + ix * PAGE_SIZE;
> +	fixup->physical = sbio->physical + ix * PAGE_SIZE;
> +	fixup->spag = sbio->spag[ix];
> +	fixup->sdev = sdev;
> +
> +	bio = bio_alloc(GFP_NOFS, 1);
> +	if (!bio)
> +		goto malloc_error;
> +	bio->bi_private = fixup;
> +	bio->bi_size = 0;
> +	bio->bi_bdev = sdev->dev->bdev;	/* FIXME: temporary for add_page */
> +	fixup->bio = bio;
> +	fixup->recheck = 0;
> +
> +	page = alloc_page(GFP_NOFS);
> +	if (!page)
> +		goto malloc_error;
> +
> +	ret = bio_add_page(bio, page, PAGE_SIZE, 0);
> +	if (!ret)
> +		goto malloc_error;
> +
> +	if (!sbio->err) {
> +		/*
> +		 * shorter path: just a checksum error, go ahead and correct it
> +		 */
> +		scrub_fixup_worker(&fixup->work);
> +		return;
> +	}
> +
> +	/*
> +	 * an I/O-error occured for one of the blocks in the bio, not
> +	 * necessarily for this one, so first try to read it separately
> +	 */
> +	SCRUB_INIT_WORK(&fixup->work, scrub_fixup_worker);
> +	fixup->recheck = 1;
> +	bio->bi_end_io = scrub_recheck_end_io;
> +	bio->bi_sector = fixup->physical >> 9;
> +	bio->bi_bdev = sdev->dev->bdev;
> +	submit_bio(0, bio);
> +
> +	return;
> +
> +malloc_error:
> +	if (bio) 
> +		bio_put(bio);
> +	if (page)
> +		__free_page(page);
> +	if (fixup)
> +		kfree(fixup);
> +	spin_lock(&sdev->stat_lock);
> +	++sdev->stat.malloc_errors;
> +	spin_unlock(&sdev->stat_lock);
> +	mutex_lock(&fs_info->scrub_lock);
> +	atomic_dec(&fs_info->scrubs_running);
> +	mutex_unlock(&fs_info->scrub_lock);
> +	wake_up(&fs_info->scrub_pause_wait);
> +}
> +
> +static void scrub_recheck_end_io(struct bio *bio, int err)
> +{
> +	struct scrub_fixup *fixup = bio->bi_private;
> +	struct btrfs_fs_info *fs_info = fixup->sdev->dev->dev_root->fs_info;
> +
> +	fixup->err = err;
> +	SCRUB_QUEUE_WORK(fs_info->scrub_workers, &fixup->work);
> +}
> +
> +static int scrub_fixup_check(struct scrub_fixup *fixup)
> +{
> +	int ret = 1;
> +	struct page *page;
> +	void *buffer;
> +	u64 flags = fixup->spag.flags;
> +
> +	page = fixup->bio->bi_io_vec[0].bv_page;
> +	buffer = kmap_atomic(page, KM_USER0);
> +	if (flags & BTRFS_EXTENT_FLAG_DATA) {
> +		ret = scrub_checksum_data(fixup->sdev,
> +					  &fixup->spag, buffer);
> +	} else if (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
> +		ret = scrub_checksum_tree_block(fixup->sdev,
> +						&fixup->spag,
> +						fixup->logical,
> +						buffer);
> +	} else {
> +		WARN_ON(1);
> +	}
> +	kunmap_atomic(buffer, KM_USER0);
> +
> +	return ret;
> +}
> +
> +static void scrub_fixup_worker(scrub_work_t *work)
> +{
> +	struct scrub_fixup *fixup;
> +	struct btrfs_fs_info *fs_info;
> +	u64 flags;
> +	int ret = 1;
> +
> +	fixup = container_of(work, struct scrub_fixup, work);
> +	fs_info = fixup->sdev->dev->dev_root->fs_info;
> +	flags = fixup->spag.flags;
> +
> +	if (fixup->recheck && fixup->err == 0)
> +		ret = scrub_fixup_check(fixup);
> +
> +	if (ret || fixup->err)
> +		scrub_fixup(fixup);
> +
> +	__free_page(fixup->bio->bi_io_vec[0].bv_page);
> +	bio_put(fixup->bio);
> +
> +	mutex_lock(&fs_info->scrub_lock);
> +	atomic_dec(&fs_info->scrubs_running);
> +	mutex_unlock(&fs_info->scrub_lock);
> +	wake_up(&fs_info->scrub_pause_wait);
> +
> +	kfree(fixup);
> +}
> +
> +static void scrub_fixup_end_io(struct bio *bio, int err)
> +{
> +	complete((struct completion *)bio->bi_private);
> +}
> +
> +static void scrub_fixup(struct scrub_fixup *fixup)
> +{
> +	struct scrub_dev *sdev = fixup->sdev;
> +	struct btrfs_fs_info *fs_info = sdev->dev->dev_root->fs_info;
> +	struct btrfs_mapping_tree *map_tree = &fs_info->mapping_tree; 
> +	struct btrfs_multi_bio *multi = NULL;
> +	struct bio *bio = fixup->bio;
> +	u64 length;
> +	int i;
> +	int ret;
> +	DECLARE_COMPLETION_ONSTACK(complete);
> +
> +	if ((fixup->spag.flags & BTRFS_EXTENT_FLAG_DATA) &&
> +	    (fixup->spag.have_csum == 0)) {
> +		/*
> +		 * nodatasum, don't try to fix anything
> +		 * FIXME: we can do better, open the inode and trigger a
> +		 * writeback
> +		 */
> +		goto uncorrectable;
> +	}
> +
> +	length = PAGE_SIZE;
> +	ret = btrfs_map_block(map_tree, REQ_WRITE, fixup->logical, &length,
> +	                      &multi, 0);
> +	if (ret || !multi || length < PAGE_SIZE) {
> +		printk(KERN_ERR
> +		       "scrub_fixup: btrfs_map_block failed us for %lld\n",
> +		       fixup->logical);
> +		WARN_ON(1);
> +		return;
> +	}
> +
> +	if (multi->num_stripes == 1) {
> +		/* there aren't any replicas */
> +		goto uncorrectable;
> +	}
> +
> +	/*
> +	 * first find a good copy
> +	 */
> +	for (i = 0; i < multi->num_stripes; ++i) {
> +		if (i == fixup->spag.mirror_num)
> +			continue;
> +
> +		bio->bi_sector = multi->stripes[i].physical >> 9;
> +		bio->bi_bdev = multi->stripes[i].dev->bdev;
> +		bio->bi_size = PAGE_SIZE;
> +		bio->bi_next = NULL;
> +		bio->bi_flags = 1 << BIO_UPTODATE;
> +		bio->bi_comp_cpu = -1;
> +		bio->bi_end_io = scrub_fixup_end_io;
> +		bio->bi_private = &complete;
> +
> +		submit_bio(0, bio);
> +
> +		wait_for_completion(&complete);
> +
> +		if (~bio->bi_flags & BIO_UPTODATE)
> +			/* I/O-error, this is not a good copy */
> +			continue;
> +
> +		ret = scrub_fixup_check(fixup);
> +		if (ret == 0)
> +			break;
> +	}
> +	if (i == multi->num_stripes)
> +		goto uncorrectable;
> +
> +	/*
> +	 * the bio now contains good data, write it back
> +	 */
> +	bio->bi_sector = fixup->physical >> 9;
> +	bio->bi_bdev = sdev->dev->bdev;
> +	bio->bi_size = PAGE_SIZE;
> +	bio->bi_next = NULL;
> +	bio->bi_flags = 1 << BIO_UPTODATE;
> +	bio->bi_comp_cpu = -1;
> +	bio->bi_end_io = scrub_fixup_end_io;
> +	bio->bi_private = &complete;
> +
> +	submit_bio(REQ_WRITE, bio);
> +
> +	wait_for_completion(&complete);
> +
> +	if (~bio->bi_flags & BIO_UPTODATE)
> +		/* I/O-error, writeback failed, give up */
> +		goto uncorrectable;
> +
> +	kfree(multi);
> +	spin_lock(&sdev->stat_lock);
> +	++sdev->stat.corrected_errors;
> +	spin_unlock(&sdev->stat_lock);
> +
> +	if (printk_ratelimit())
> +		printk(KERN_ERR "btrfs: fixed up at %lld\n", fixup->logical);
> +	return;
> +
> +uncorrectable:
> +	kfree(multi);
> +	spin_lock(&sdev->stat_lock);
> +	++sdev->stat.uncorrectable_errors;
> +	spin_unlock(&sdev->stat_lock);
> +
> +	if (printk_ratelimit())
> +		printk(KERN_ERR "btrfs: unable to fixup at %lld\n",
> +			 fixup->logical);
> +}
> +
> +static void scrub_bio_end_io(struct bio *bio, int err)
> +{
> +	struct scrub_bio *sbio = bio->bi_private;
> +	struct scrub_dev *sdev = sbio->sdev;
> +	struct btrfs_fs_info *fs_info = sdev->dev->dev_root->fs_info;
> +
> +	sbio->err = err;
> +
> +	SCRUB_QUEUE_WORK(fs_info->scrub_workers, &sbio->work);
> +}
> +
> +static void scrub_checksum(scrub_work_t *work)
> +{
> +	struct scrub_bio *sbio = container_of(work, struct scrub_bio, work);
> +	struct scrub_dev *sdev = sbio->sdev;
> +	struct page *page;
> +	void *buffer;
> +	int i;
> +	u64 flags;
> +	u64 logical;
> +	int ret;
> +
> +	if (sbio->err) {
> +		for (i = 0; i < sbio->count; ++i) {
> +			scrub_recheck_error(sbio, i);
> +		}
> +		spin_lock(&sdev->stat_lock);
> +		++sdev->stat.read_errors;
> +		spin_unlock(&sdev->stat_lock);
> +		goto out;
> +	}
> +	for (i = 0; i < sbio->count; ++i) {
> +		page = sbio->bio->bi_io_vec[i].bv_page;
> +		buffer = kmap_atomic(page, KM_USER0);
> +		flags = sbio->spag[i].flags;
> +		logical = sbio->logical + i * PAGE_SIZE;
> +		ret = 0;
> +		if (flags & BTRFS_EXTENT_FLAG_DATA) {
> +			ret = scrub_checksum_data(sdev, sbio->spag + i, buffer);
> +		} else if (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
> +			ret = scrub_checksum_tree_block(sdev, sbio->spag + i,
> +			                                logical, buffer);
> +		} else if (flags & BTRFS_EXTENT_FLAG_SUPER) {
> +			BUG_ON(i);
> +			(void)scrub_checksum_super(sbio, buffer);
> +		} else {
> +			WARN_ON(1);
> +		}
> +		kunmap_atomic(buffer, KM_USER0);
> +		if (ret)
> +			scrub_recheck_error(sbio, i);
> +	}
> +
> +out:
> +	spin_lock(&sdev->list_lock);
> +	sbio->next_free = sdev->first_free;
> +	sdev->first_free = sbio->index;
> +	spin_unlock(&sdev->list_lock);
> +	atomic_dec(&sdev->in_flight);
> +	wake_up(&sdev->list_wait);
> +}
> +
> +static int scrub_checksum_data(struct scrub_dev *sdev,
> +                               struct scrub_page *spag, void *buffer)
> +{
> +	u8 csum[BTRFS_CSUM_SIZE];
> +	u32 crc = ~(u32)0;
> +	int fail = 0;
> +	struct btrfs_root *root = sdev->dev->dev_root;
> +
> +	if (!spag->have_csum)
> +		return 0;
> +
> +	crc = btrfs_csum_data(root, buffer, crc, PAGE_SIZE);
> +	btrfs_csum_final(crc, csum);
> +	if (memcmp(csum, spag->csum, sdev->csum_size))
> +		fail = 1;
> +
> +	spin_lock(&sdev->stat_lock);
> +	++sdev->stat.data_extents_scrubbed;
> +	sdev->stat.data_bytes_scrubbed += PAGE_SIZE;
> +	if (fail)
> +		++sdev->stat.csum_errors;
> +	spin_unlock(&sdev->stat_lock);
> +
> +	return fail;
> +}
> +
> +static int scrub_checksum_tree_block(struct scrub_dev *sdev,
> +                                     struct scrub_page *spag, u64 logical,
> +                                     void *buffer)
> +{
> +	struct btrfs_header *h;
> +	struct btrfs_root *root = sdev->dev->dev_root;
> +	struct btrfs_fs_info *fs_info = root->fs_info;
> +	u8 csum[BTRFS_CSUM_SIZE];
> +	u32 crc = ~(u32)0;
> +	int fail = 0;
> +	int crc_fail = 0;
> +
> +	/*
> +	 * we don't use the getter functions here, as we
> +	 * a) don't have an extent buffer and
> +	 * b) the page is already kmapped
> +	 */
> +	h = (struct btrfs_header *)buffer;
> +
> +	if (logical != le64_to_cpu(h->bytenr))
> +		++fail;
> +
> +	if (spag->generation != le64_to_cpu(h->generation))
> +		++fail;
> +
> +	if (memcmp(h->fsid, fs_info->fsid, BTRFS_UUID_SIZE))
> +		++fail;
> +
> +	if (memcmp(h->chunk_tree_uuid, fs_info->chunk_tree_uuid,
> +	           BTRFS_UUID_SIZE))
> +		++fail;
> +
> +	crc = btrfs_csum_data(root, buffer + BTRFS_CSUM_SIZE, crc,
> +	                      PAGE_SIZE - BTRFS_CSUM_SIZE);
> +	btrfs_csum_final(crc, csum);
> +	if (memcmp(csum, h->csum, sdev->csum_size))
> +		++crc_fail;
> +
> +	spin_lock(&sdev->stat_lock);
> +	++sdev->stat.tree_extents_scrubbed;
> +	sdev->stat.tree_bytes_scrubbed += PAGE_SIZE;
> +	if (crc_fail)
> +		++sdev->stat.csum_errors;
> +	if (fail)
> +		++sdev->stat.verify_errors;
> +	spin_unlock(&sdev->stat_lock);
> +
> +	return (fail || crc_fail);
> +}
> +
> +static int scrub_checksum_super(struct scrub_bio *sbio, void *buffer)
> +{
> +	struct btrfs_super_block *s;
> +	u64 logical;
> +	struct scrub_dev *sdev = sbio->sdev;
> +	struct btrfs_root *root = sdev->dev->dev_root;
> +	struct btrfs_fs_info *fs_info = root->fs_info;
> +	u8 csum[BTRFS_CSUM_SIZE];
> +	u32 crc = ~(u32)0;
> +	int fail = 0;
> +
> +	s = (struct btrfs_super_block *)buffer;
> +	logical = sbio->logical;
> +
> +	if (logical != le64_to_cpu(s->bytenr))
> +		++fail;
> +
> +	if (sbio->spag[0].generation != le64_to_cpu(s->generation))
> +		++fail;
> +
> +	if (memcmp(s->fsid, fs_info->fsid, BTRFS_UUID_SIZE))
> +		++fail;
> +
> +	crc = btrfs_csum_data(root, buffer + BTRFS_CSUM_SIZE, crc,
> +	                      PAGE_SIZE - BTRFS_CSUM_SIZE);
> +	btrfs_csum_final(crc, csum);
> +	if (memcmp(csum, s->csum, sbio->sdev->csum_size))
> +		++fail;
> +
> +	if (fail) {
> +		/*
> +		 * if we find an error in a super block, we just report it.
> +		 * They will get written with the next transaction commit
> +		 * anyway
> +		 */
> +		spin_lock(&sdev->stat_lock);
> +		++sdev->stat.super_errors;
> +		spin_unlock(&sdev->stat_lock);
> +	}
> +
> +	return fail;
> +}
> +
> +static int scrub_submit(struct scrub_dev *sdev)
> +{
> +	struct scrub_bio *sbio;
> +
> +	if (sdev->curr == -1)
> +		return 0;
> +
> +	sbio = sdev->bios + sdev->curr;
> +	
> +	sbio->bio->bi_sector = sbio->physical >> 9;
> +	sbio->bio->bi_size = sbio->count * PAGE_SIZE;
> +	sbio->bio->bi_next = NULL;
> +	sbio->bio->bi_flags = 1 << BIO_UPTODATE;
> +	sbio->bio->bi_comp_cpu = -1;
> +	sbio->bio->bi_bdev = sdev->dev->bdev;
> +	sdev->curr = -1;
> +	atomic_inc(&sdev->in_flight);
> +
> +	submit_bio(0, sbio->bio);
> +
> +	return 0;
> +}
> +
> +static int scrub_page(struct scrub_dev *sdev, u64 logical, u64 len,
> +                      u64 physical, u64 flags, u64 gen, u64 mirror_num,
> +                      u8 *csum, int force)
> +{
> +	struct scrub_bio *sbio;
> +again:
> +	/*
> +	 * grab a fresh bio or wait for one to become available
> +	 */
> +	while (sdev->curr == -1) {
> +		unsigned long flags;
> +		spin_lock_irqsave(&sdev->list_lock, flags);

Is this called from an interrupt or why is the _irqsave variant used?

> +		sdev->curr = sdev->first_free;
> +		if (sdev->curr != -1) {
> +			sdev->first_free = sdev->bios[sdev->curr].next_free;
> +			sdev->bios[sdev->curr].next_free = -1;
> +			sdev->bios[sdev->curr].count = 0;
> +			spin_unlock_irqrestore(&sdev->list_lock, flags);
> +		} else {
> +			spin_unlock_irqrestore(&sdev->list_lock, flags);
> +			wait_event(sdev->list_wait, sdev->first_free != -1);
> +		}
> +	}
> +	sbio = sdev->bios + sdev->curr;
> +	if (sbio->count == 0) {
> +		sbio->physical = physical;
> +		sbio->logical = logical;
> +	} else if (sbio->physical + sbio->count * PAGE_SIZE != physical) {
> +		scrub_submit(sdev);
> +		goto again;
> +	}
> +	sbio->spag[sbio->count].flags = flags;
> +	sbio->spag[sbio->count].generation = gen;
> +	sbio->spag[sbio->count].have_csum = 0;
> +	sbio->spag[sbio->count].mirror_num = mirror_num;
> +	if (csum) {
> +		sbio->spag[sbio->count].have_csum = 1;
> +		memcpy(sbio->spag[sbio->count].csum, csum, sdev->csum_size);
> +	}
> +	++sbio->count;
> +	if (sbio->count == SCRUB_PAGES_PER_BIO || force)
> +		scrub_submit(sdev);
> +		
> +	return 0;
> +}
> +
> +static int scrub_find_csum(struct scrub_dev *sdev, u64 logical, u64 len,
> +                           u8 *csum)
> +{
> +	struct btrfs_ordered_sum *sum = NULL;
> +	int ret = 0;
> +	unsigned long i;
> +	unsigned long num_sectors;
> +	u32 sectorsize = sdev->dev->dev_root->sectorsize;
> +
> +	while (!list_empty(&sdev->csum_list)) {
> +		sum = list_first_entry(&sdev->csum_list,
> +				       struct btrfs_ordered_sum, list);
> +		if (sum->bytenr > logical)
> +			return 0;
> +		if (sum->bytenr + sum->len > logical)
> +			break;
> +
> +		++sdev->stat.csum_discards;
> +		list_del(&sum->list);
> +		kfree(sum);
> +		sum = NULL;
> +	}
> +	if (!sum)
> +		return 0;
> +
> +	num_sectors = sum->len / sectorsize;
> +	for (i = 0; i < num_sectors; ++i) {
> +		if (sum->sums[i].bytenr == logical) {
> +			memcpy(csum, &sum->sums[i].sum, sdev->csum_size);
> +			ret = 1;
> +			break;
> +		}
> +	}
> +	if (ret && i == num_sectors - 1) {
> +		list_del(&sum->list);
> +		kfree(sum);
> +	}
> +	return ret;
> +}
> +
> +/* scrub extent tries to collect up to 64 kB for each bio */
> +static int scrub_extent(struct scrub_dev *sdev, u64 logical, u64 len,
> +                        u64 physical, u64 flags, u64 gen, u64 mirror_num)
> +{
> +	int ret;
> +	u8 csum[BTRFS_CSUM_SIZE];
> +
> +	while(len) {
> +		u64 l = min_t(u64, len, PAGE_SIZE);
> +		int have_csum = 0;
> +
> +		if (flags & BTRFS_EXTENT_FLAG_DATA) {
> +			/* push csums to sbio */
> +			have_csum = scrub_find_csum(sdev, logical, l, csum);
> +			if (have_csum == 0)
> +				++sdev->stat.no_csum;
> +		}
> +		ret = scrub_page(sdev, logical, l, physical, flags, gen,
> +		                 mirror_num, have_csum ? csum : NULL, 0);
> +		if (ret)
> +			return ret;
> +		len -= l;
> +		logical += l;
> +		physical += l;
> +	}
> +	return 0;
> +}
> +
> +static noinline_for_stack int scrub_stripe(struct scrub_dev *sdev,
> +	struct map_lookup *map, int num, u64 base, u64 length)
> +{
> +	struct btrfs_path *path;
> +	struct btrfs_fs_info *fs_info = sdev->dev->dev_root->fs_info;
> +	struct btrfs_root *root = fs_info->extent_root;
> +	struct btrfs_root *csum_root = fs_info->csum_root;
> +	struct btrfs_extent_item *extent;
> +	u64 flags;
> +	int ret;
> +	int slot;
> +	int i;
> +	int nstripes;
> +	int start_stripe;
> +	struct extent_buffer *l;
> +	struct btrfs_key key;
> +	u64 physical;
> +	u64 logical;
> +	u64 generation;
> +	u64 mirror_num;
> +
> +	u64 increment = map->stripe_len;
> +	u64 offset;
> +
> +	nstripes = length;
> +	offset = 0;
> +	do_div(nstripes, map->stripe_len);
> +	if (map->type & BTRFS_BLOCK_GROUP_RAID0) {
> +		offset = map->stripe_len * num;
> +		increment = map->stripe_len * map->num_stripes;
> +		mirror_num = 0;
> +	} else if (map->type & BTRFS_BLOCK_GROUP_RAID10) {
> +		int factor = map->num_stripes / map->sub_stripes;
> +		offset = map->stripe_len * (num / map->sub_stripes);
> +		increment = map->stripe_len * factor;
> +		mirror_num = num % map->sub_stripes;
> +	} else if (map->type & BTRFS_BLOCK_GROUP_RAID1) {
> +		increment = map->stripe_len;
> +		mirror_num = num % map->num_stripes;
> +	} else if (map->type & BTRFS_BLOCK_GROUP_DUP) {
> +		increment = map->stripe_len;
> +		mirror_num = num % map->num_stripes;
> +	} else {
> +		increment = map->stripe_len;
> +		mirror_num = 0;
> +	}
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	path->reada = 2;
> +	path->search_commit_root = 1;
> +	path->skip_locking = 1;
> +
> +	/*
> +	 * find all extents for each stripe and just read them to get
> +	 * them into the page cache
> +	 * FIXME: we can do better. build a more intelligent prefetching
> +	 */
> +	logical = base + offset;
> +	physical = map->stripes[num].physical;
> +	ret = 0;
> +	for (i = 0; i < nstripes; ++i) {
> +		key.objectid = logical;
> +		key.type = BTRFS_EXTENT_ITEM_KEY;
> +		key.offset = (u64)0;
> +
> +		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> +		if (ret < 0)
> +			goto out;
> +
> +		l = path->nodes[0];
> +		slot = path->slots[0];
> +		btrfs_item_key_to_cpu(l, &key, slot);
> +		if (key.objectid != logical) {
> +			ret = btrfs_previous_item(root, path, 0,
> +			                          BTRFS_EXTENT_ITEM_KEY);
> +			if (ret < 0)
> +				goto out;
> +		}
> +
> +		while (1) {
> +			l = path->nodes[0];
> +			slot = path->slots[0];
> +			if (slot >= btrfs_header_nritems(l)) {
> +				ret = btrfs_next_leaf(root, path);
> +				if (ret == 0)
> +					continue;
> +				if (ret < 0)
> +					goto out;
> +
> +				break;
> +			}
> +			btrfs_item_key_to_cpu(l, &key, slot);
> +
> +			if (key.objectid + key.offset <= logical)
> +				goto next1;
> +
> +			if (key.objectid >= logical + map->stripe_len)
> +				break;
> +next1:
> +			path->slots[0]++;
> +		}
> +		btrfs_release_path(root, path);
> +		logical += increment;
> +		physical += map->stripe_len;
> +		cond_resched();
> +	}
> +
> +	/*
> +	 * collect all data csums for the stripe to avoid seeking during
> +	 * the scrub. This might currently (crc32) end up to be about 1MB
> +	 */
> +	start_stripe = 0;
> +again:
> +	logical = base + offset + start_stripe * map->stripe_len;
> +	physical = map->stripes[num].physical + start_stripe * map->stripe_len;
> +	for (i = start_stripe; i < nstripes; ++i) {
> +		ret = btrfs_lookup_csums_range(csum_root, logical,
> +		                               logical + map->stripe_len - 1,
> +		                               &sdev->csum_list, 1);
> +		if (ret)
> +			goto out;
> +
> +		logical += increment;
> +		cond_resched();
> +	}
> +	/*
> +	 * now find all extents for each stripe and scrub them
> +	 */
> +	logical = base + offset + start_stripe * map->stripe_len;
> +	physical = map->stripes[num].physical + start_stripe * map->stripe_len;
> +	ret = 0;
> +	for (i = start_stripe; i < nstripes; ++i) {
> +		/*
> +		 * canceled?
> +		 */
> +		if (atomic_read(&fs_info->scrub_cancel_req) ||
> +		    atomic_read(&sdev->cancel_req)) {
> +			ret = -ECANCELED;
> +			goto out;
> +		}
> +		/*
> +		 * check to see if we have to pause
> +		 */
> +		if (atomic_read(&fs_info->scrub_pause_req)) {
> +			/* push queued extents */
> +			scrub_submit(sdev);
> +			wait_event(sdev->list_wait,
> +			           atomic_read(&sdev->in_flight) == 0);
> +			atomic_inc(&fs_info->scrubs_paused);
> +			wake_up(&fs_info->scrub_pause_wait);
> +			mutex_lock(&fs_info->scrub_lock);
> +			while(atomic_read(&fs_info->scrub_pause_req)) {
> +				mutex_unlock(&fs_info->scrub_lock);
> +				wait_event(fs_info->scrub_pause_wait,
> +				   atomic_read(&fs_info->scrub_pause_req) == 0);
> +				mutex_lock(&fs_info->scrub_lock);
> +			}
> +			atomic_dec(&fs_info->scrubs_paused);
> +			mutex_unlock(&fs_info->scrub_lock);
> +			wake_up(&fs_info->scrub_pause_wait);
> +			scrub_free_csums(sdev);
> +			goto again;
> +		}
> +
> +		key.objectid = logical;
> +		key.type = BTRFS_EXTENT_ITEM_KEY;
> +		key.offset = (u64)0;
> +
> +		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> +		if (ret < 0)
> +			goto out;
> +
> +		l = path->nodes[0];
> +		slot = path->slots[0];
> +		btrfs_item_key_to_cpu(l, &key, slot);
> +		if (key.objectid != logical) {
> +			ret = btrfs_previous_item(root, path, 0,
> +			                          BTRFS_EXTENT_ITEM_KEY);
> +			if (ret < 0)
> +				goto out;
> +		}
> +
> +		while (1) {
> +			l = path->nodes[0];
> +			slot = path->slots[0];
> +			if (slot >= btrfs_header_nritems(l)) {
> +				ret = btrfs_next_leaf(root, path);
> +				if (ret == 0)
> +					continue;
> +				if (ret < 0)
> +					goto out;
> +
> +				break;
> +			}
> +			btrfs_item_key_to_cpu(l, &key, slot);
> +
> +			if (key.objectid + key.offset <= logical)
> +				goto next;
> +
> +			if (key.objectid >= logical + map->stripe_len)
> +				break;
> +
> +			if (btrfs_key_type(&key) != BTRFS_EXTENT_ITEM_KEY)
> +				goto next;
> +
> +			extent = btrfs_item_ptr(l, slot,
> +			                        struct btrfs_extent_item);
> +			flags = btrfs_extent_flags(l, extent);
> +			generation = btrfs_extent_generation(l, extent);
> +
> +			if (key.objectid < logical &&
> +			    (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK)) {
> +				printk(KERN_ERR
> +				       "btrfs scrub: tree block %lld spanning "
> +				       "stripes, ignored. logical=%lld\n",
> +				       key.objectid, logical);
> +				goto next;
> +			}
> +
> +			/*
> +			 * trim extent to this stripe
> +			 */
> +			if (key.objectid < logical) {
> +				key.offset -= logical - key.objectid;
> +				key.objectid = logical;
> +			}
> +			if (key.objectid + key.offset >
> +			    logical + map->stripe_len) {
> +				key.offset = logical + map->stripe_len -
> +				             key.objectid;
> +			}
> +
> +			ret = scrub_extent(sdev, key.objectid, key.offset,
> +			                   key.objectid - logical + physical,
> +			                   flags, generation, mirror_num);
> +			if (ret)
> +				goto out;
> +next:
> +			path->slots[0]++;
> +		}
> +		btrfs_release_path(root, path);
> +		logical += increment;
> +		physical += map->stripe_len;
> +		spin_lock(&sdev->stat_lock);
> +		sdev->stat.last_physical = physical;
> +		spin_unlock(&sdev->stat_lock);
> +	}
> +	/* push queued extents */
> +	scrub_submit(sdev);
> +
> +out:
> +	btrfs_free_path(path);
> +	return ret < 0 ? ret : 0;
> +}
> +
> +static noinline_for_stack int scrub_chunk(struct scrub_dev *sdev, 
> +	u64 chunk_tree, u64 chunk_objectid, u64 chunk_offset, u64 length)
> +{
> +	struct btrfs_mapping_tree *map_tree =
> +		&sdev->dev->dev_root->fs_info->mapping_tree;
> +	struct map_lookup *map;
> +	struct extent_map *em;
> +	int i;
> +	int ret;
> +
> +	read_lock(&map_tree->map_tree.lock);
> +	em = lookup_extent_mapping(&map_tree->map_tree, chunk_offset, 1);
> +	read_unlock(&map_tree->map_tree.lock);
> +
> +	if (!em)
> +		return -EINVAL;
> +
> +	map = (struct map_lookup *)em->bdev;
> +	if (em->start != chunk_offset)
> +		return -EINVAL;
> +
> +	if (em->len < length)
> +		return -EINVAL;
> +
> +	for (i = 0; i < map->num_stripes; ++i) {
> +		if (map->stripes[i].dev == sdev->dev) {
> +			ret = scrub_stripe(sdev, map, i, chunk_offset, length);
> +			if (ret)
> +				return ret;
> +		}
> +	}
> +	return 0;
> +}
> +
> +static noinline_for_stack
> +int scrub_enumerate_chunks(struct scrub_dev *sdev, u64 start, u64 end)
> +{
> +	struct btrfs_dev_extent *dev_extent = NULL;
> +	struct btrfs_path *path;
> +	struct btrfs_root *root = sdev->dev->dev_root;
> +	struct btrfs_fs_info *fs_info = root->fs_info;
> +	u64 length;
> +	u64 chunk_tree;
> +	u64 chunk_objectid;
> +	u64 chunk_offset;
> +	int ret;
> +	int slot;
> +	struct extent_buffer *l;
> +	struct btrfs_key key;
> +	struct btrfs_key found_key;
> +	struct btrfs_block_group_cache *cache;
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	path->reada = 2;
> +	path->search_commit_root = 1;
> +	path->skip_locking = 1;
> +
> +	key.objectid = sdev->dev->devid;
> +	key.offset = 0ull;
> +	key.type = BTRFS_DEV_EXTENT_KEY;
> +
> +
> +	while (1) {
> +		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> +		if (ret < 0)
> +			goto out;
> +		ret = 0;
> +
> +		l = path->nodes[0];
> +		slot = path->slots[0];
> +
> +		btrfs_item_key_to_cpu(l, &found_key, slot);
> +
> +		if (found_key.objectid != sdev->dev->devid)
> +			break;
> +
> +		if (btrfs_key_type(&key) != BTRFS_DEV_EXTENT_KEY)
> +			break;
> +
> +		if (found_key.offset >= end)
> +			break;
> +
> +		if (found_key.offset < key.offset)
> +			break;
> +
> +		dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent);
> +		length = btrfs_dev_extent_length(l, dev_extent);
> +
> +		if (found_key.offset + length <= start) {
> +			key.offset = found_key.offset + length;
> +			btrfs_release_path(root, path);
> +			continue;
> +		}
> +
> +		chunk_tree = btrfs_dev_extent_chunk_tree(l, dev_extent);
> +		chunk_objectid = btrfs_dev_extent_chunk_objectid(l, dev_extent);
> +		chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent);
> +
> +		/*
> +		 * get a reference on the corresponding block group to prevent
> +		 * the chunk from going away while we scrub it
> +		 */
> +		cache = btrfs_lookup_block_group(fs_info, chunk_offset);
> +		if (!cache) {
> +			ret = -ENOENT;
> +			goto out;
> +		}
> +		ret = scrub_chunk(sdev, chunk_tree, chunk_objectid,
> +		                  chunk_offset, length);
> +		btrfs_put_block_group(cache);
> +		if (ret)
> +			break;
> +
> +		key.offset = found_key.offset + length;
> +		btrfs_release_path(root, path);
> +	}
> +
> +out:
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
> +static noinline_for_stack int scrub_supers(struct scrub_dev *sdev)
> +{
> +	int	i;
> +	u64	bytenr;
> +	u64	gen;
> +	int	ret;
> +	struct btrfs_device *device = sdev->dev;
> +	struct btrfs_root *root = device->dev_root;
> +
> +	gen = root->fs_info->last_trans_committed;
> +
> +	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
> +		bytenr = btrfs_sb_offset(i);
> +		if (bytenr + BTRFS_SUPER_INFO_SIZE >= device->total_bytes)
> +			break;
> +
> +		ret = scrub_page(sdev, bytenr, PAGE_SIZE, bytenr, 
> +		                 BTRFS_EXTENT_FLAG_SUPER, gen, i, NULL, 1);
> +		if (ret)
> +			return ret;
> +	}
> +	wait_event(sdev->list_wait, atomic_read(&sdev->in_flight) == 0);
> +
> +	return 0;
> +}
> +
> +/*
> + * get a reference count on fs_info->scrub_workers. start worker if necessary
> + */
> +static noinline_for_stack int scrub_workers_get(struct btrfs_root *root)
> +{
> +	struct btrfs_fs_info *fs_info = root->fs_info;
> +
> +	mutex_lock(&fs_info->scrub_lock);
> +	if (fs_info->scrub_workers_refcnt == 0) {
> +#ifdef SCRUB_BTRFS_WORKER
> +		btrfs_start_workers(&fs_info->scrub_workers, 1);
> +#else
> +		fs_info->scrub_workers = create_workqueue("scrub");
> +		if (!fs_info->scrub_workers) {
> +			mutex_unlock(&fs_info->scrub_lock);
> +			return -ENOMEM;
> +		}
> +#endif
> +	}
> +	++fs_info->scrub_workers_refcnt;
> +	mutex_unlock(&fs_info->scrub_lock);
> +
> +	return 0;
> +}
> +
> +static noinline_for_stack void scrub_workers_put(struct btrfs_root *root)

This func is always called immediately after a mutex_unlock(scrub_lock),
and then takes the lock again. I suggest to drop locking here and adjust
all callsites.

Same applies for scrub_workers_get()

> +{
> +	struct btrfs_fs_info *fs_info = root->fs_info;
> +	
> +	mutex_lock(&fs_info->scrub_lock);
> +	if (--fs_info->scrub_workers_refcnt == 0) {
> +#ifdef SCRUB_BTRFS_WORKER
> +		btrfs_stop_workers(&fs_info->scrub_workers);
> +#else
> +		destroy_workqueue(fs_info->scrub_workers);
> +		fs_info->scrub_workers = NULL;
> +#endif
> +
> +	}
> +	WARN_ON(fs_info->scrub_workers_refcnt < 0);
> +	mutex_unlock(&fs_info->scrub_lock);
> +}
> +
> +
> +int btrfs_scrub_dev(struct btrfs_root *root, u64 devid, u64 start, u64 end,
> +                    struct btrfs_scrub_progress *progress)
> +{
> +	struct scrub_dev *sdev;
> +	struct btrfs_fs_info *fs_info = root->fs_info;
> +	int ret;
> +	struct btrfs_device *dev;
> +
> +	if (root->fs_info->closing)
> +		return -EINVAL;
> +
> +	/*
> +	 * check some assumptions
> +	 */
> +	if (root->sectorsize != PAGE_SIZE ||
> +	    root->sectorsize != root->leafsize ||
> +	    root->sectorsize != root->nodesize) {
> +		printk(KERN_ERR "btrfs_scrub: size assumptions fail\n");
> +		return -EINVAL;
> +	}
> +	    
> +	ret = scrub_workers_get(root);
> +	if (ret)
> +		return ret;
> +
> +	mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
> +	dev = btrfs_find_device(root, devid, NULL, NULL);
> +	if (!dev || dev->missing) {
> +		mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
> +		scrub_workers_put(root);
> +		return -ENODEV;
> +	}
> +
> +	mutex_lock(&fs_info->scrub_lock);
> +	if (dev->scrub_device) {
> +		mutex_unlock(&fs_info->scrub_lock);
> +		scrub_workers_put(root);
> +		return -EINPROGRESS;
> +	}
> +	sdev = scrub_setup_dev(dev);
> +	if (IS_ERR(sdev)) {
> +		mutex_unlock(&fs_info->scrub_lock);
> +		scrub_workers_put(root);
> +		return PTR_ERR(sdev);
> +	}
> +	dev->scrub_device = sdev;
> +
> +	atomic_inc(&fs_info->scrubs_running);
> +	mutex_unlock(&fs_info->scrub_lock);
> +	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
> +
> +	down_read(&fs_info->scrub_super_lock);
> +	ret = scrub_supers(sdev);
> +	up_read(&fs_info->scrub_super_lock);
> +
> +	if (!ret)
> +		ret = scrub_enumerate_chunks(sdev, start, end);
> +
> +	wait_event(sdev->list_wait, atomic_read(&sdev->in_flight) == 0);
> +
> +	mutex_lock(&fs_info->scrub_lock);
> +	atomic_dec(&fs_info->scrubs_running);
> +	mutex_unlock(&fs_info->scrub_lock);
> +	wake_up(&fs_info->scrub_pause_wait);
> +
> +	if (progress)
> +		memcpy(progress, &sdev->stat, sizeof(*progress));
> +
> +	mutex_lock(&fs_info->scrub_lock);
> +	dev->scrub_device = NULL;
> +	mutex_unlock(&fs_info->scrub_lock);
> +
> +	scrub_free_dev(sdev);
> +	scrub_workers_put(root);
> +
> +	return ret;
> +}
> +
> +int btrfs_scrub_pause(struct btrfs_root *root)
> +{
> +	struct btrfs_fs_info *fs_info = root->fs_info;
> +	mutex_lock(&fs_info->scrub_lock);
> +	atomic_inc(&fs_info->scrub_pause_req);
> +	while (atomic_read(&fs_info->scrubs_paused) !=
> +	       atomic_read(&fs_info->scrubs_running)) {
> +		mutex_unlock(&fs_info->scrub_lock);
> +		wait_event(fs_info->scrub_pause_wait,
> +			   atomic_read(&fs_info->scrubs_paused) ==
> +			   atomic_read(&fs_info->scrubs_running));
> +		mutex_lock(&fs_info->scrub_lock);
> +	}
> +	mutex_unlock(&fs_info->scrub_lock);
> +
> +	return 0;
> +}
> +
> +int btrfs_scrub_continue(struct btrfs_root *root)
> +{
> +	struct btrfs_fs_info *fs_info = root->fs_info;
> +
> +	atomic_dec(&fs_info->scrub_pause_req);
> +	wake_up(&fs_info->scrub_pause_wait);
> +	return 0;
> +}
> +
> +int btrfs_scrub_pause_super(struct btrfs_root *root)
> +{
> +	down_write(&root->fs_info->scrub_super_lock);
> +	return 0;
> +}
> +
> +int btrfs_scrub_continue_super(struct btrfs_root *root)
> +{
> +	up_write(&root->fs_info->scrub_super_lock);
> +	return 0;
> +}
> +
> +int btrfs_scrub_cancel(struct btrfs_root *root)
> +{
> +	struct btrfs_fs_info *fs_info = root->fs_info;
> +	mutex_lock(&fs_info->scrub_lock);
> +	if (!atomic_read(&fs_info->scrubs_running)) {
> +		mutex_unlock(&fs_info->scrub_lock);
> +		return -ENOTCONN;
> +	}
> +
> +	atomic_inc(&fs_info->scrub_cancel_req);
> +	while(atomic_read(&fs_info->scrubs_running)) {
> +		mutex_unlock(&fs_info->scrub_lock);
> +		wait_event(fs_info->scrub_pause_wait,
> +			   atomic_read(&fs_info->scrubs_running) == 0);
> +		mutex_lock(&fs_info->scrub_lock);
> +	}
> +	atomic_dec(&fs_info->scrub_cancel_req);
> +	mutex_unlock(&fs_info->scrub_lock);
> +	
> +	return 0;
> +}
> +
> +int btrfs_scrub_cancel_dev(struct btrfs_root *root, struct btrfs_device *dev)
> +{
> +	struct btrfs_fs_info *fs_info = root->fs_info;
> +	struct scrub_dev *sdev;
> +
> +	mutex_lock(&fs_info->scrub_lock);
> +	sdev = dev->scrub_device;
> +	if (!sdev) {
> +		mutex_unlock(&fs_info->scrub_lock);
> +		return -ENOTCONN;
> +	}
> +	atomic_inc(&sdev->cancel_req);
> +	while(dev->scrub_device) {
> +		mutex_unlock(&fs_info->scrub_lock);
> +		wait_event(fs_info->scrub_pause_wait,
> +		           dev->scrub_device == NULL);
> +		mutex_lock(&fs_info->scrub_lock);
> +	}
> +	mutex_unlock(&fs_info->scrub_lock);
> +		
> +	return 0;
> +}
> +int btrfs_scrub_cancel_devid(struct btrfs_root *root, u64 devid)
> +{
> +	struct btrfs_fs_info *fs_info = root->fs_info;
> +	struct btrfs_device *dev;
> +	int ret;
> +
> +	/*
> +	 * we have to hold the device_list_mutex here so the device
> +	 * does not go away in cancel_dev. FIXME: find a better solution
> +	 */
> +	mutex_lock(&fs_info->fs_devices->device_list_mutex);
> +	dev = btrfs_find_device(root, devid, NULL, NULL);
> +	if (!dev) {
> +		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
> +		return -ENODEV;
> +	}
> +	ret = btrfs_scrub_cancel_dev(root, dev);
> +	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
> +
> +	return ret;
> +}
> +	
> +int btrfs_scrub_progress(struct btrfs_root *root, u64 devid,
> +                         struct btrfs_scrub_progress *progress)
> +{
> +	struct btrfs_device *dev;
> +	struct scrub_dev *sdev = NULL;
> +
> +	mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
> +	dev = btrfs_find_device(root, devid, NULL, NULL);
> +	if (dev)
> +		sdev = dev->scrub_device;
> +	if (sdev)
> +		memcpy(progress, &sdev->stat, sizeof(*progress));
> +	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
> +
> +	return dev ? (sdev ? 0 : -ENOTCONN) : -ENODEV;
> +}
> -- 
> 1.7.3.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 5/6] btrfs: add state information for scrub
  2011-03-11 14:49 ` [PATCH v2 5/6] btrfs: add state information for scrub Arne Jansen
@ 2011-03-11 16:53   ` David Sterba
  2011-03-12 13:13     ` Arne Jansen
  0 siblings, 1 reply; 15+ messages in thread
From: David Sterba @ 2011-03-11 16:53 UTC (permalink / raw)
  To: linux-btrfs; +Cc: sensille

On Fri, Mar 11, 2011 at 03:49:42PM +0100, Arne Jansen wrote:
> Add structures and state information needed for scrub
> 
> Signed-off-by: Arne Jansen <sensille@gmx.net>
> ---
>  fs/btrfs/ctree.h   |   26 ++++++++++++++++++++++++++
>  fs/btrfs/disk-io.c |   15 +++++++++++++++
>  fs/btrfs/ioctl.h   |   17 +++++++++++++++++
>  fs/btrfs/volumes.h |    3 +++
>  4 files changed, 61 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 030c321..3584179 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -23,6 +23,7 @@
>  #include <linux/mm.h>
>  #include <linux/highmem.h>
>  #include <linux/fs.h>
> +#include <linux/rwsem.h>
>  #include <linux/completion.h>
>  #include <linux/backing-dev.h>
>  #include <linux/wait.h>
> @@ -32,6 +33,7 @@
>  #include "extent_io.h"
>  #include "extent_map.h"
>  #include "async-thread.h"
> +#include "ioctl.h"
>  
>  struct btrfs_trans_handle;
>  struct btrfs_transaction;
> @@ -48,6 +50,8 @@ struct btrfs_ordered_sum;
>  
>  #define BTRFS_COMPAT_EXTENT_TREE_V0
>  
> +#define SCRUB_BTRFS_WORKER
> +
>  /*
>   * files bigger than this get some pre-flushing when they are added
>   * to the ordered operations list.  That way we limit the total
> @@ -508,6 +512,12 @@ struct btrfs_extent_item_v0 {
>  /* use full backrefs for extent pointers in the block */
>  #define BTRFS_BLOCK_FLAG_FULL_BACKREF	(1ULL << 8)
>  
> +/*
> + * this flag is only used internally by scrub and may be changed at any time
> + * it is only declared here to avoid collisions
> + */
> +#define BTRFS_EXTENT_FLAG_SUPER		(1ULL << 48)
> +
>  struct btrfs_tree_block_info {
>  	struct btrfs_disk_key key;
>  	u8 level;
> @@ -1067,6 +1077,22 @@ struct btrfs_fs_info {
>  
>  	void *bdev_holder;
>  
> +	/* private scrub information */
> +	struct mutex scrub_lock;
> +	struct scrub_info *scrub_info;
                           ^^^^^^^^^^

I did not find any reference to this item

> +	atomic_t scrubs_running;
> +	atomic_t scrub_pause_req;
> +	atomic_t scrubs_paused;
> +	atomic_t scrub_cancel_req;

This make me think ... you declare atomics and yet lock (nearly) every
variable use like 

+       mutex_lock(&fs_info->scrub_lock);
+       atomic_inc(&fs_info->scrubs_running);
+       mutex_unlock(&fs_info->scrub_lock);

or

+       mutex_lock(&fs_info->scrub_lock);
+       if (!atomic_read(&fs_info->scrubs_running)) {
+               mutex_unlock(&fs_info->scrub_lock);
+               return -ENOTCONN;
+       }

imho this is not needed with atomics. Moreover, the locking is not
consistent, quick grep for atomic_read shows many statements without any
locks around.


> +	wait_queue_head_t scrub_pause_wait;
> +	struct rw_semaphore scrub_super_lock;
> +	int scrub_workers_refcnt;

A refcount could be an atomic too ...

> +#ifdef SCRUB_BTRFS_WORKER
> +	struct btrfs_workers scrub_workers;
> +#else
> +	struct workqueue_struct *scrub_workers;
> +#endif
> +

Apart from the atomics and scrub_workers_refcnt, there is only
scrub_workers left that needs locking protection, which can be done
under a spinlock.


dave

>  	/* filesystem state */
>  	u64 fs_state;
>  };
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 924a366..4d62bc3 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1677,6 +1677,21 @@ struct btrfs_root *open_ctree(struct super_block *sb,
>  	INIT_LIST_HEAD(&fs_info->ordered_extents);
>  	spin_lock_init(&fs_info->ordered_extent_lock);
>  
> +	mutex_init(&fs_info->scrub_lock);
> +	atomic_set(&fs_info->scrubs_running, 0);
> +	atomic_set(&fs_info->scrub_pause_req, 0);
> +	atomic_set(&fs_info->scrubs_paused, 0);
> +	atomic_set(&fs_info->scrub_cancel_req, 0);
> +	init_waitqueue_head(&fs_info->scrub_pause_wait);
> +	init_rwsem(&fs_info->scrub_super_lock);
> +	fs_info->scrub_workers_refcnt = 0;
> +#ifdef SCRUB_BTRFS_WORKER
> +	btrfs_init_workers(&fs_info->scrub_workers, "scrub",
> +			   fs_info->thread_pool_size, &fs_info->generic_worker);
> +#else
> +	fs_info->scrub_workers = NULL;
> +#endif
> +
>  	sb->s_blocksize = 4096;
>  	sb->s_blocksize_bits = blksize_bits(4096);
>  	sb->s_bdi = &fs_info->bdi;
> diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
> index 8fb3821..973e7c8 100644
> --- a/fs/btrfs/ioctl.h
> +++ b/fs/btrfs/ioctl.h
> @@ -42,6 +42,23 @@ struct btrfs_ioctl_vol_args_v2 {
>  	char name[BTRFS_SUBVOL_NAME_MAX + 1];
>  };
>  
> +struct btrfs_scrub_progress {
> +	__u64 data_extents_scrubbed;
> +	__u64 tree_extents_scrubbed;
> +	__u64 data_bytes_scrubbed;
> +	__u64 tree_bytes_scrubbed;
> +	__u64 read_errors;
> +	__u64 csum_errors;
> +	__u64 verify_errors;
> +	__u64 no_csum;
> +	__u64 csum_discards;
> +	__u64 super_errors;
> +	__u64 malloc_errors;
> +	__u64 uncorrectable_errors;
> +	__u64 corrected_errors;
> +	__u64 last_physical;
> +};
> +
>  #define BTRFS_INO_LOOKUP_PATH_MAX 4080
>  struct btrfs_ioctl_ino_lookup_args {
>  	__u64 treeid;
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 0ccc982..92204d9 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -86,6 +86,9 @@ struct btrfs_device {
>  	/* physical drive uuid (or lvm uuid) */
>  	u8 uuid[BTRFS_UUID_SIZE];
>  
> +	/* per-device scrub information */
> +	struct scrub_dev *scrub_device;
> +
>  	struct btrfs_work work;
>  };
>  
> -- 
> 1.7.3.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/6] btrfs: add scrub code and prototypes
  2011-03-11 16:34   ` David Sterba
@ 2011-03-12 10:54     ` Arne Jansen
  2011-03-22 16:38       ` David Sterba
  0 siblings, 1 reply; 15+ messages in thread
From: Arne Jansen @ 2011-03-12 10:54 UTC (permalink / raw)
  To: linux-btrfs, dave

Hi David,

thanks for your reviews. I'll update the code accordingly,
comments follow inline.

--
Arne

David Sterba wrote:
> On Fri, Mar 11, 2011 at 03:49:40PM +0100, Arne Jansen wrote:
>> This is the main scrub code.
>>
>> Signed-off-by: Arne Jansen <sensille@gmx.net>
>> ---
>>  fs/btrfs/Makefile |    2 +-
>>  fs/btrfs/ctree.h  |   14 +
>>  fs/btrfs/scrub.c  | 1463 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 1478 insertions(+), 1 deletions(-)
>>
>> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
>> index 31610ea..8fda313 100644
>> --- a/fs/btrfs/Makefile
>> +++ b/fs/btrfs/Makefile
>> @@ -7,4 +7,4 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
>>  	   extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \
>>  	   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
>>  	   export.o tree-log.o acl.o free-space-cache.o zlib.o lzo.o \
>> -	   compression.o delayed-ref.o relocation.o
>> +	   compression.o delayed-ref.o relocation.o scrub.o
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 4c99834..030c321 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -2610,4 +2610,18 @@ void btrfs_reloc_pre_snapshot(struct btrfs_trans_handle *trans,
>>  			      u64 *bytes_to_reserve);
>>  void btrfs_reloc_post_snapshot(struct btrfs_trans_handle *trans,
>>  			      struct btrfs_pending_snapshot *pending);
>> +
>> +/* scrub.c */
>> +int btrfs_scrub_dev(struct btrfs_root *root, u64 devid, u64 start, u64 end,
>> +                    struct btrfs_scrub_progress *progress);
>> +int btrfs_scrub_pause(struct btrfs_root *root);
>> +int btrfs_scrub_pause_super(struct btrfs_root *root);
>> +int btrfs_scrub_continue(struct btrfs_root *root);
>> +int btrfs_scrub_continue_super(struct btrfs_root *root);
>> +int btrfs_scrub_cancel(struct btrfs_root *root);
>> +int btrfs_scrub_cancel_dev(struct btrfs_root *root, struct btrfs_device *dev);
>> +int btrfs_scrub_cancel_devid(struct btrfs_root *root, u64 devid);
>> +int btrfs_scrub_progress(struct btrfs_root *root, u64 devid,
>> +                         struct btrfs_scrub_progress *progress);
>> +
>>  #endif
>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>> new file mode 100644
>> index 0000000..d606f4d
>> --- /dev/null
>> +++ b/fs/btrfs/scrub.c
>> @@ -0,0 +1,1463 @@
>> +/*
>> + * Copyright (C) 2011 STRATO.  All rights reserved.
>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU General Public
>> + * License v2 as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> + * General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public
>> + * License along with this program; if not, write to the
>> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
>> + * Boston, MA 021110-1307, USA.
>> + */
>> +
>> +#include <linux/sched.h>
>> +#include <linux/pagemap.h>
>> +#include <linux/writeback.h>
>> +#include <linux/blkdev.h>
>> +#include <linux/rbtree.h>
>> +#include <linux/slab.h>
>> +#include <linux/workqueue.h>
>> +#include "ctree.h"
>> +#include "volumes.h"
>> +#include "disk-io.h"
>> +#include "ordered-data.h"
>> +
>> +/*
>> + * This is only the first step towards a full-features scrub. It reads all
>> + * extent and super block and verifies the checksums. In case a bad checksum
>> + * is found or the extent cannot be read, good data will be written back if
>> + * any can be found.
>> + *
>> + * Future enhancements:
>> + *  - To enhance the performance, better read-ahead strategies for the
>> + *    extent-tree can be employed.
>> + *  - In case an unrepairable extent is encountered, track which files are
>> + *    affected and report them
>> + *  - In case of a read error on files with nodatasum, map the file and read
>> + *    the extent to trigger a writeback of the good copy
>> + *  - track and record media errors, throw out bad devices
>> + *  - add a readonly mode
>> + *  - add a mode to also read unallocated space
>> + */
>> +
>> +#ifdef SCRUB_BTRFS_WORKER
>> +typedef struct btrfs_work scrub_work_t;
>> +#define SCRUB_INIT_WORK(work, fn) do { (work)->func = (fn); } while (0)
>> +#define SCRUB_QUEUE_WORK(wq, w) do { btrfs_queue_worker(&(wq), w); } while (0)
>> +#else
>> +typedef struct work_struct scrub_work_t;
>> +#define SCRUB_INIT_WORK INIT_WORK
>> +#define SCRUB_QUEUE_WORK queue_work
>> +#endif
>> +
>> +struct scrub_bio;
>> +struct scrub_page;
>> +struct scrub_dev;
>> +struct scrub_fixup;
>> +static void scrub_bio_end_io(struct bio *bio, int err);
>> +static void scrub_checksum(scrub_work_t *work);
>> +static int scrub_checksum_data(struct scrub_dev *sdev,
>> +                               struct scrub_page *spag, void *buffer);
>> +static int scrub_checksum_tree_block(struct scrub_dev *sdev,
>> +                                     struct scrub_page *spag, u64 logical,
>> +                                     void *buffer);
>> +static int scrub_checksum_super(struct scrub_bio *sbio, void *buffer);
>> +static void scrub_recheck_end_io(struct bio *bio, int err);
>> +static void scrub_fixup_worker(scrub_work_t *work);
>> +static void scrub_fixup(struct scrub_fixup *fixup);
>> +
>> +#define SCRUB_PAGES_PER_BIO	16	/* 64k per bio */
>> +#define SCRUB_BIOS_PER_DEV	16	/* 1 MB per device in flight */
>> +
>> +struct scrub_page {
>> +	u64			flags;  /* extent flags */
>> +	u64			generation;
>> +	u64			mirror_num;
>> +	int			have_csum;
>> +	u8			csum[BTRFS_CSUM_SIZE];
>> +};
>> +
>> +struct scrub_bio {
>> +	int			index;
>> +	struct scrub_dev	*sdev;
>> +	struct bio		*bio;
>> +	int			err;
>> +	u64			logical;
>> +	u64			physical;
>> +	struct scrub_page	spag[SCRUB_PAGES_PER_BIO];
>> +	u64			count;
>> +	int			next_free;
>> +	scrub_work_t		work;
>> +};
>> +
>> +struct scrub_dev {
>> +	struct scrub_bio	bios[SCRUB_BIOS_PER_DEV];
> 
> sizeof(struct scrub_bio) == 1160
> SCRUB_BIOS_PER_DEV == 16
> 
>> +	struct btrfs_device	*dev;
>> +	int			first_free;
>> +	int			curr;
>> +	atomic_t		in_flight;
>> +	spinlock_t		list_lock;
>> +	wait_queue_head_t	list_wait;
>> +	u16			csum_size;
>> +	struct list_head	csum_list;
>> +	atomic_t		cancel_req;
>> +	/*
>> +	 * statistics
>> +	 */
>> +	struct btrfs_scrub_progress stat;
>> +	spinlock_t		stat_lock;
>> +};
> 
> sizeof(struct scrub_dev) == 18760 on an x86_64, an order 3 allocation in
> scrub_setup_dev()

Is this a problem? There are only few allocations of it, one per device.

> 
>> +
>> +struct scrub_fixup {
>> +	struct scrub_dev	*sdev;
>> +	struct bio		*bio;
>> +	u64			logical;
>> +	u64			physical;
>> +	struct scrub_page	spag;
>> +	scrub_work_t		work;
>> +	int			err;
>> +	int			recheck;
>> +};
>> +
>> +static void scrub_free_csums(struct scrub_dev *sdev)
>> +{
>> +	while(!list_empty(&sdev->csum_list)) {
>> +		struct btrfs_ordered_sum *sum;
>> +		sum = list_first_entry(&sdev->csum_list,
>> +		                       struct btrfs_ordered_sum, list);
>> +		list_del(&sum->list);
>> +		kfree(sum);
>> +	}
>> +}
>> +
>> +static noinline_for_stack void scrub_free_dev(struct scrub_dev *sdev)
>> +{
>> +	int i;
>> +	int j;
>> +	struct page *last_page;
>> +
>> +	if (!sdev)
>> +		return;
>> +
>> +	for (i = 0; i < SCRUB_BIOS_PER_DEV; ++i) {
>> +		struct bio *bio = sdev->bios[i].bio;
>> +		if (bio)
>                    ^^^^^
> stop when we found something to free?
> 

right, good catch. It's obviously the wrong way round.

> 
>> +			break;
>> +		
>> +		last_page = NULL;
>> +		for (j = 0; j < bio->bi_vcnt; ++j) {
>                                 ^^^
> and dereference it.
> 
>> +			if (bio->bi_io_vec[i].bv_page == last_page)
>> +				continue;
>> +			last_page = bio->bi_io_vec[i].bv_page;
>> +			__free_page(last_page);
>> +		}
>> +		bio_put(sdev->bios[i].bio);
>> +	}
>> +
>> +	scrub_free_csums(sdev);
>> +	kfree(sdev);
>> +}
>> +
>> +static noinline_for_stack
>> +struct scrub_dev *scrub_setup_dev(struct btrfs_device *dev)
>> +{
>> +	struct scrub_dev *sdev;
>> +	int		i;
>> +	int		j;
>> +	int		ret;
>> +	struct btrfs_fs_info *fs_info = dev->dev_root->fs_info;
> 
> (coding style expects a newline here)

coding style issues are always the gravest. Hope you never catch me with
a line > 80 columns ;)

> 
>> +	sdev = kzalloc(sizeof(*sdev), GFP_NOFS);
>> +	if (!sdev)
>> +		goto nomem;
>> +	sdev->dev = dev;
>> +	for (i = 0; i < SCRUB_BIOS_PER_DEV; ++i) {
>> +		struct bio *bio;
>> +
>> +		bio = bio_alloc(GFP_NOFS, SCRUB_PAGES_PER_BIO);
>> +		if (!bio)
>> +			goto nomem;
>> +
>> +		sdev->bios[i].index = i;
>> +		sdev->bios[i].sdev = sdev;
>> +		sdev->bios[i].bio = bio;
>> +		sdev->bios[i].count = 0;
>> +		SCRUB_INIT_WORK(&sdev->bios[i].work, scrub_checksum);
>> +		bio->bi_private = sdev->bios + i;
>> +		bio->bi_end_io = scrub_bio_end_io;
>> +		bio->bi_sector = 0;
>> +		bio->bi_bdev = dev->bdev;
>> +		bio->bi_size = 0;
>> +
>> +		for (j = 0; j < SCRUB_PAGES_PER_BIO; ++j) {
>> +			struct page *page;
>> +			page = alloc_page(GFP_NOFS);
>> +			if (!page)
>> +				goto nomem;
>> +
>> +			ret = bio_add_page(bio, page, PAGE_SIZE, 0);
>> +			if (!ret)
>> +				goto nomem;
>> +		}
>> +		WARN_ON(bio->bi_vcnt != SCRUB_PAGES_PER_BIO);
>> +
>> +		if (i != SCRUB_BIOS_PER_DEV-1)
>> +			sdev->bios[i].next_free = i + 1;
>> +		 else
>> +			sdev->bios[i].next_free = -1;
>> +	}
>> +	sdev->first_free = 0;
>> +	sdev->curr = -1;
>> +	atomic_set(&sdev->in_flight, 0);
>> +	atomic_set(&sdev->cancel_req, 0);
>> +	sdev->csum_size = btrfs_super_csum_size(&fs_info->super_copy);
>> +	INIT_LIST_HEAD(&sdev->csum_list);
>> +	
>> +	spin_lock_init(&sdev->list_lock);
>> +	spin_lock_init(&sdev->stat_lock);
>> +	init_waitqueue_head(&sdev->list_wait);
>> +	return sdev;
>> +
>> +nomem:
>> +	scrub_free_dev(sdev);
> 
> When taking the 'goto nomem' path, either all bios are leaked, or the
> check in scrub_free_dev is buggy ...
> 
>> +	return ERR_PTR(-ENOMEM);
>> +}
>> +
>> +/*
>> + * scrub_recheck_error gets called when either verification of the page
>> + * failed or the bio failed to read, e.g. with EIO. In the latter case,
>> + * recheck_error gets called for every page in the bio, even though only
>> + * one may be bad
>> + */
>> +static void scrub_recheck_error(struct scrub_bio *sbio, int ix)
>> +{
>> +	struct scrub_dev *sdev = sbio->sdev;
>> +	struct btrfs_fs_info *fs_info = sdev->dev->dev_root->fs_info;
>> +	struct bio *bio = NULL;
>> +	struct page *page = NULL;
>> +	struct scrub_fixup *fixup = NULL;
>> +	int ret;
>> +
>> +	/*
>> +	 * while we're in here we do not want the transaction to commit.
>> +	 * To prevent it, we increment scrubs_running. scrub_pause will
>> +	 * have to wait until we're finished
>> +	 */
>> +	mutex_lock(&fs_info->scrub_lock);
>> +	atomic_inc(&fs_info->scrubs_running);
>> +	mutex_unlock(&fs_info->scrub_lock);
>> +
>> +	fixup = kzalloc(sizeof(*fixup), GFP_NOFS);
>> +	if (!fixup)
>> +		goto malloc_error;
>> +
>> +	fixup->logical = sbio->logical + ix * PAGE_SIZE;
>> +	fixup->physical = sbio->physical + ix * PAGE_SIZE;
>> +	fixup->spag = sbio->spag[ix];
>> +	fixup->sdev = sdev;
>> +
>> +	bio = bio_alloc(GFP_NOFS, 1);
>> +	if (!bio)
>> +		goto malloc_error;
>> +	bio->bi_private = fixup;
>> +	bio->bi_size = 0;
>> +	bio->bi_bdev = sdev->dev->bdev;	/* FIXME: temporary for add_page */
>> +	fixup->bio = bio;
>> +	fixup->recheck = 0;
>> +
>> +	page = alloc_page(GFP_NOFS);
>> +	if (!page)
>> +		goto malloc_error;
>> +
>> +	ret = bio_add_page(bio, page, PAGE_SIZE, 0);
>> +	if (!ret)
>> +		goto malloc_error;
>> +
>> +	if (!sbio->err) {
>> +		/*
>> +		 * shorter path: just a checksum error, go ahead and correct it
>> +		 */
>> +		scrub_fixup_worker(&fixup->work);
>> +		return;
>> +	}
>> +
>> +	/*
>> +	 * an I/O-error occured for one of the blocks in the bio, not
>> +	 * necessarily for this one, so first try to read it separately
>> +	 */
>> +	SCRUB_INIT_WORK(&fixup->work, scrub_fixup_worker);
>> +	fixup->recheck = 1;
>> +	bio->bi_end_io = scrub_recheck_end_io;
>> +	bio->bi_sector = fixup->physical >> 9;
>> +	bio->bi_bdev = sdev->dev->bdev;
>> +	submit_bio(0, bio);
>> +
>> +	return;
>> +
>> +malloc_error:
>> +	if (bio) 
>> +		bio_put(bio);
>> +	if (page)
>> +		__free_page(page);
>> +	if (fixup)
>> +		kfree(fixup);
>> +	spin_lock(&sdev->stat_lock);
>> +	++sdev->stat.malloc_errors;
>> +	spin_unlock(&sdev->stat_lock);
>> +	mutex_lock(&fs_info->scrub_lock);
>> +	atomic_dec(&fs_info->scrubs_running);
>> +	mutex_unlock(&fs_info->scrub_lock);
>> +	wake_up(&fs_info->scrub_pause_wait);
>> +}
>> +
>> +static void scrub_recheck_end_io(struct bio *bio, int err)
>> +{
>> +	struct scrub_fixup *fixup = bio->bi_private;
>> +	struct btrfs_fs_info *fs_info = fixup->sdev->dev->dev_root->fs_info;
>> +
>> +	fixup->err = err;
>> +	SCRUB_QUEUE_WORK(fs_info->scrub_workers, &fixup->work);
>> +}
>> +
>> +static int scrub_fixup_check(struct scrub_fixup *fixup)
>> +{
>> +	int ret = 1;
>> +	struct page *page;
>> +	void *buffer;
>> +	u64 flags = fixup->spag.flags;
>> +
>> +	page = fixup->bio->bi_io_vec[0].bv_page;
>> +	buffer = kmap_atomic(page, KM_USER0);
>> +	if (flags & BTRFS_EXTENT_FLAG_DATA) {
>> +		ret = scrub_checksum_data(fixup->sdev,
>> +					  &fixup->spag, buffer);
>> +	} else if (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
>> +		ret = scrub_checksum_tree_block(fixup->sdev,
>> +						&fixup->spag,
>> +						fixup->logical,
>> +						buffer);
>> +	} else {
>> +		WARN_ON(1);
>> +	}
>> +	kunmap_atomic(buffer, KM_USER0);
>> +
>> +	return ret;
>> +}
>> +
>> +static void scrub_fixup_worker(scrub_work_t *work)
>> +{
>> +	struct scrub_fixup *fixup;
>> +	struct btrfs_fs_info *fs_info;
>> +	u64 flags;
>> +	int ret = 1;
>> +
>> +	fixup = container_of(work, struct scrub_fixup, work);
>> +	fs_info = fixup->sdev->dev->dev_root->fs_info;
>> +	flags = fixup->spag.flags;
>> +
>> +	if (fixup->recheck && fixup->err == 0)
>> +		ret = scrub_fixup_check(fixup);
>> +
>> +	if (ret || fixup->err)
>> +		scrub_fixup(fixup);
>> +
>> +	__free_page(fixup->bio->bi_io_vec[0].bv_page);
>> +	bio_put(fixup->bio);
>> +
>> +	mutex_lock(&fs_info->scrub_lock);
>> +	atomic_dec(&fs_info->scrubs_running);
>> +	mutex_unlock(&fs_info->scrub_lock);
>> +	wake_up(&fs_info->scrub_pause_wait);
>> +
>> +	kfree(fixup);
>> +}
>> +
>> +static void scrub_fixup_end_io(struct bio *bio, int err)
>> +{
>> +	complete((struct completion *)bio->bi_private);
>> +}
>> +
>> +static void scrub_fixup(struct scrub_fixup *fixup)
>> +{
>> +	struct scrub_dev *sdev = fixup->sdev;
>> +	struct btrfs_fs_info *fs_info = sdev->dev->dev_root->fs_info;
>> +	struct btrfs_mapping_tree *map_tree = &fs_info->mapping_tree; 
>> +	struct btrfs_multi_bio *multi = NULL;
>> +	struct bio *bio = fixup->bio;
>> +	u64 length;
>> +	int i;
>> +	int ret;
>> +	DECLARE_COMPLETION_ONSTACK(complete);
>> +
>> +	if ((fixup->spag.flags & BTRFS_EXTENT_FLAG_DATA) &&
>> +	    (fixup->spag.have_csum == 0)) {
>> +		/*
>> +		 * nodatasum, don't try to fix anything
>> +		 * FIXME: we can do better, open the inode and trigger a
>> +		 * writeback
>> +		 */
>> +		goto uncorrectable;
>> +	}
>> +
>> +	length = PAGE_SIZE;
>> +	ret = btrfs_map_block(map_tree, REQ_WRITE, fixup->logical, &length,
>> +	                      &multi, 0);
>> +	if (ret || !multi || length < PAGE_SIZE) {
>> +		printk(KERN_ERR
>> +		       "scrub_fixup: btrfs_map_block failed us for %lld\n",
>> +		       fixup->logical);
>> +		WARN_ON(1);
>> +		return;
>> +	}
>> +
>> +	if (multi->num_stripes == 1) {
>> +		/* there aren't any replicas */
>> +		goto uncorrectable;
>> +	}
>> +
>> +	/*
>> +	 * first find a good copy
>> +	 */
>> +	for (i = 0; i < multi->num_stripes; ++i) {
>> +		if (i == fixup->spag.mirror_num)
>> +			continue;
>> +
>> +		bio->bi_sector = multi->stripes[i].physical >> 9;
>> +		bio->bi_bdev = multi->stripes[i].dev->bdev;
>> +		bio->bi_size = PAGE_SIZE;
>> +		bio->bi_next = NULL;
>> +		bio->bi_flags = 1 << BIO_UPTODATE;
>> +		bio->bi_comp_cpu = -1;
>> +		bio->bi_end_io = scrub_fixup_end_io;
>> +		bio->bi_private = &complete;
>> +
>> +		submit_bio(0, bio);
>> +
>> +		wait_for_completion(&complete);
>> +
>> +		if (~bio->bi_flags & BIO_UPTODATE)
>> +			/* I/O-error, this is not a good copy */
>> +			continue;
>> +
>> +		ret = scrub_fixup_check(fixup);
>> +		if (ret == 0)
>> +			break;
>> +	}
>> +	if (i == multi->num_stripes)
>> +		goto uncorrectable;
>> +
>> +	/*
>> +	 * the bio now contains good data, write it back
>> +	 */
>> +	bio->bi_sector = fixup->physical >> 9;
>> +	bio->bi_bdev = sdev->dev->bdev;
>> +	bio->bi_size = PAGE_SIZE;
>> +	bio->bi_next = NULL;
>> +	bio->bi_flags = 1 << BIO_UPTODATE;
>> +	bio->bi_comp_cpu = -1;
>> +	bio->bi_end_io = scrub_fixup_end_io;
>> +	bio->bi_private = &complete;
>> +
>> +	submit_bio(REQ_WRITE, bio);
>> +
>> +	wait_for_completion(&complete);
>> +
>> +	if (~bio->bi_flags & BIO_UPTODATE)
>> +		/* I/O-error, writeback failed, give up */
>> +		goto uncorrectable;
>> +
>> +	kfree(multi);
>> +	spin_lock(&sdev->stat_lock);
>> +	++sdev->stat.corrected_errors;
>> +	spin_unlock(&sdev->stat_lock);
>> +
>> +	if (printk_ratelimit())
>> +		printk(KERN_ERR "btrfs: fixed up at %lld\n", fixup->logical);
>> +	return;
>> +
>> +uncorrectable:
>> +	kfree(multi);
>> +	spin_lock(&sdev->stat_lock);
>> +	++sdev->stat.uncorrectable_errors;
>> +	spin_unlock(&sdev->stat_lock);
>> +
>> +	if (printk_ratelimit())
>> +		printk(KERN_ERR "btrfs: unable to fixup at %lld\n",
>> +			 fixup->logical);
>> +}
>> +
>> +static void scrub_bio_end_io(struct bio *bio, int err)
>> +{
>> +	struct scrub_bio *sbio = bio->bi_private;
>> +	struct scrub_dev *sdev = sbio->sdev;
>> +	struct btrfs_fs_info *fs_info = sdev->dev->dev_root->fs_info;
>> +
>> +	sbio->err = err;
>> +
>> +	SCRUB_QUEUE_WORK(fs_info->scrub_workers, &sbio->work);
>> +}
>> +
>> +static void scrub_checksum(scrub_work_t *work)
>> +{
>> +	struct scrub_bio *sbio = container_of(work, struct scrub_bio, work);
>> +	struct scrub_dev *sdev = sbio->sdev;
>> +	struct page *page;
>> +	void *buffer;
>> +	int i;
>> +	u64 flags;
>> +	u64 logical;
>> +	int ret;
>> +
>> +	if (sbio->err) {
>> +		for (i = 0; i < sbio->count; ++i) {
>> +			scrub_recheck_error(sbio, i);
>> +		}
>> +		spin_lock(&sdev->stat_lock);
>> +		++sdev->stat.read_errors;
>> +		spin_unlock(&sdev->stat_lock);
>> +		goto out;
>> +	}
>> +	for (i = 0; i < sbio->count; ++i) {
>> +		page = sbio->bio->bi_io_vec[i].bv_page;
>> +		buffer = kmap_atomic(page, KM_USER0);
>> +		flags = sbio->spag[i].flags;
>> +		logical = sbio->logical + i * PAGE_SIZE;
>> +		ret = 0;
>> +		if (flags & BTRFS_EXTENT_FLAG_DATA) {
>> +			ret = scrub_checksum_data(sdev, sbio->spag + i, buffer);
>> +		} else if (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
>> +			ret = scrub_checksum_tree_block(sdev, sbio->spag + i,
>> +			                                logical, buffer);
>> +		} else if (flags & BTRFS_EXTENT_FLAG_SUPER) {
>> +			BUG_ON(i);
>> +			(void)scrub_checksum_super(sbio, buffer);
>> +		} else {
>> +			WARN_ON(1);
>> +		}
>> +		kunmap_atomic(buffer, KM_USER0);
>> +		if (ret)
>> +			scrub_recheck_error(sbio, i);
>> +	}
>> +
>> +out:
>> +	spin_lock(&sdev->list_lock);
>> +	sbio->next_free = sdev->first_free;
>> +	sdev->first_free = sbio->index;
>> +	spin_unlock(&sdev->list_lock);
>> +	atomic_dec(&sdev->in_flight);
>> +	wake_up(&sdev->list_wait);
>> +}
>> +
>> +static int scrub_checksum_data(struct scrub_dev *sdev,
>> +                               struct scrub_page *spag, void *buffer)
>> +{
>> +	u8 csum[BTRFS_CSUM_SIZE];
>> +	u32 crc = ~(u32)0;
>> +	int fail = 0;
>> +	struct btrfs_root *root = sdev->dev->dev_root;
>> +
>> +	if (!spag->have_csum)
>> +		return 0;
>> +
>> +	crc = btrfs_csum_data(root, buffer, crc, PAGE_SIZE);
>> +	btrfs_csum_final(crc, csum);
>> +	if (memcmp(csum, spag->csum, sdev->csum_size))
>> +		fail = 1;
>> +
>> +	spin_lock(&sdev->stat_lock);
>> +	++sdev->stat.data_extents_scrubbed;
>> +	sdev->stat.data_bytes_scrubbed += PAGE_SIZE;
>> +	if (fail)
>> +		++sdev->stat.csum_errors;
>> +	spin_unlock(&sdev->stat_lock);
>> +
>> +	return fail;
>> +}
>> +
>> +static int scrub_checksum_tree_block(struct scrub_dev *sdev,
>> +                                     struct scrub_page *spag, u64 logical,
>> +                                     void *buffer)
>> +{
>> +	struct btrfs_header *h;
>> +	struct btrfs_root *root = sdev->dev->dev_root;
>> +	struct btrfs_fs_info *fs_info = root->fs_info;
>> +	u8 csum[BTRFS_CSUM_SIZE];
>> +	u32 crc = ~(u32)0;
>> +	int fail = 0;
>> +	int crc_fail = 0;
>> +
>> +	/*
>> +	 * we don't use the getter functions here, as we
>> +	 * a) don't have an extent buffer and
>> +	 * b) the page is already kmapped
>> +	 */
>> +	h = (struct btrfs_header *)buffer;
>> +
>> +	if (logical != le64_to_cpu(h->bytenr))
>> +		++fail;
>> +
>> +	if (spag->generation != le64_to_cpu(h->generation))
>> +		++fail;
>> +
>> +	if (memcmp(h->fsid, fs_info->fsid, BTRFS_UUID_SIZE))
>> +		++fail;
>> +
>> +	if (memcmp(h->chunk_tree_uuid, fs_info->chunk_tree_uuid,
>> +	           BTRFS_UUID_SIZE))
>> +		++fail;
>> +
>> +	crc = btrfs_csum_data(root, buffer + BTRFS_CSUM_SIZE, crc,
>> +	                      PAGE_SIZE - BTRFS_CSUM_SIZE);
>> +	btrfs_csum_final(crc, csum);
>> +	if (memcmp(csum, h->csum, sdev->csum_size))
>> +		++crc_fail;
>> +
>> +	spin_lock(&sdev->stat_lock);
>> +	++sdev->stat.tree_extents_scrubbed;
>> +	sdev->stat.tree_bytes_scrubbed += PAGE_SIZE;
>> +	if (crc_fail)
>> +		++sdev->stat.csum_errors;
>> +	if (fail)
>> +		++sdev->stat.verify_errors;
>> +	spin_unlock(&sdev->stat_lock);
>> +
>> +	return (fail || crc_fail);
>> +}
>> +
>> +static int scrub_checksum_super(struct scrub_bio *sbio, void *buffer)
>> +{
>> +	struct btrfs_super_block *s;
>> +	u64 logical;
>> +	struct scrub_dev *sdev = sbio->sdev;
>> +	struct btrfs_root *root = sdev->dev->dev_root;
>> +	struct btrfs_fs_info *fs_info = root->fs_info;
>> +	u8 csum[BTRFS_CSUM_SIZE];
>> +	u32 crc = ~(u32)0;
>> +	int fail = 0;
>> +
>> +	s = (struct btrfs_super_block *)buffer;
>> +	logical = sbio->logical;
>> +
>> +	if (logical != le64_to_cpu(s->bytenr))
>> +		++fail;
>> +
>> +	if (sbio->spag[0].generation != le64_to_cpu(s->generation))
>> +		++fail;
>> +
>> +	if (memcmp(s->fsid, fs_info->fsid, BTRFS_UUID_SIZE))
>> +		++fail;
>> +
>> +	crc = btrfs_csum_data(root, buffer + BTRFS_CSUM_SIZE, crc,
>> +	                      PAGE_SIZE - BTRFS_CSUM_SIZE);
>> +	btrfs_csum_final(crc, csum);
>> +	if (memcmp(csum, s->csum, sbio->sdev->csum_size))
>> +		++fail;
>> +
>> +	if (fail) {
>> +		/*
>> +		 * if we find an error in a super block, we just report it.
>> +		 * They will get written with the next transaction commit
>> +		 * anyway
>> +		 */
>> +		spin_lock(&sdev->stat_lock);
>> +		++sdev->stat.super_errors;
>> +		spin_unlock(&sdev->stat_lock);
>> +	}
>> +
>> +	return fail;
>> +}
>> +
>> +static int scrub_submit(struct scrub_dev *sdev)
>> +{
>> +	struct scrub_bio *sbio;
>> +
>> +	if (sdev->curr == -1)
>> +		return 0;
>> +
>> +	sbio = sdev->bios + sdev->curr;
>> +	
>> +	sbio->bio->bi_sector = sbio->physical >> 9;
>> +	sbio->bio->bi_size = sbio->count * PAGE_SIZE;
>> +	sbio->bio->bi_next = NULL;
>> +	sbio->bio->bi_flags = 1 << BIO_UPTODATE;
>> +	sbio->bio->bi_comp_cpu = -1;
>> +	sbio->bio->bi_bdev = sdev->dev->bdev;
>> +	sdev->curr = -1;
>> +	atomic_inc(&sdev->in_flight);
>> +
>> +	submit_bio(0, sbio->bio);
>> +
>> +	return 0;
>> +}
>> +
>> +static int scrub_page(struct scrub_dev *sdev, u64 logical, u64 len,
>> +                      u64 physical, u64 flags, u64 gen, u64 mirror_num,
>> +                      u8 *csum, int force)
>> +{
>> +	struct scrub_bio *sbio;
>> +again:
>> +	/*
>> +	 * grab a fresh bio or wait for one to become available
>> +	 */
>> +	while (sdev->curr == -1) {
>> +		unsigned long flags;
>> +		spin_lock_irqsave(&sdev->list_lock, flags);
> 
> Is this called from an interrupt or why is the _irqsave variant used?

You're right, it is not needed anymore. It used to get locked directly
from the end_io callback, but now everything is deferred to workers.

> 
>> +		sdev->curr = sdev->first_free;
>> +		if (sdev->curr != -1) {
>> +			sdev->first_free = sdev->bios[sdev->curr].next_free;
>> +			sdev->bios[sdev->curr].next_free = -1;
>> +			sdev->bios[sdev->curr].count = 0;
>> +			spin_unlock_irqrestore(&sdev->list_lock, flags);
>> +		} else {
>> +			spin_unlock_irqrestore(&sdev->list_lock, flags);
>> +			wait_event(sdev->list_wait, sdev->first_free != -1);
>> +		}
>> +	}
>> +	sbio = sdev->bios + sdev->curr;
>> +	if (sbio->count == 0) {
>> +		sbio->physical = physical;
>> +		sbio->logical = logical;
>> +	} else if (sbio->physical + sbio->count * PAGE_SIZE != physical) {
>> +		scrub_submit(sdev);
>> +		goto again;
>> +	}
>> +	sbio->spag[sbio->count].flags = flags;
>> +	sbio->spag[sbio->count].generation = gen;
>> +	sbio->spag[sbio->count].have_csum = 0;
>> +	sbio->spag[sbio->count].mirror_num = mirror_num;
>> +	if (csum) {
>> +		sbio->spag[sbio->count].have_csum = 1;
>> +		memcpy(sbio->spag[sbio->count].csum, csum, sdev->csum_size);
>> +	}
>> +	++sbio->count;
>> +	if (sbio->count == SCRUB_PAGES_PER_BIO || force)
>> +		scrub_submit(sdev);
>> +		
>> +	return 0;
>> +}
>> +
>> +static int scrub_find_csum(struct scrub_dev *sdev, u64 logical, u64 len,
>> +                           u8 *csum)
>> +{
>> +	struct btrfs_ordered_sum *sum = NULL;
>> +	int ret = 0;
>> +	unsigned long i;
>> +	unsigned long num_sectors;
>> +	u32 sectorsize = sdev->dev->dev_root->sectorsize;
>> +
>> +	while (!list_empty(&sdev->csum_list)) {
>> +		sum = list_first_entry(&sdev->csum_list,
>> +				       struct btrfs_ordered_sum, list);
>> +		if (sum->bytenr > logical)
>> +			return 0;
>> +		if (sum->bytenr + sum->len > logical)
>> +			break;
>> +
>> +		++sdev->stat.csum_discards;
>> +		list_del(&sum->list);
>> +		kfree(sum);
>> +		sum = NULL;
>> +	}
>> +	if (!sum)
>> +		return 0;
>> +
>> +	num_sectors = sum->len / sectorsize;
>> +	for (i = 0; i < num_sectors; ++i) {
>> +		if (sum->sums[i].bytenr == logical) {
>> +			memcpy(csum, &sum->sums[i].sum, sdev->csum_size);
>> +			ret = 1;
>> +			break;
>> +		}
>> +	}
>> +	if (ret && i == num_sectors - 1) {
>> +		list_del(&sum->list);
>> +		kfree(sum);
>> +	}
>> +	return ret;
>> +}
>> +
>> +/* scrub extent tries to collect up to 64 kB for each bio */
>> +static int scrub_extent(struct scrub_dev *sdev, u64 logical, u64 len,
>> +                        u64 physical, u64 flags, u64 gen, u64 mirror_num)
>> +{
>> +	int ret;
>> +	u8 csum[BTRFS_CSUM_SIZE];
>> +
>> +	while(len) {
>> +		u64 l = min_t(u64, len, PAGE_SIZE);
>> +		int have_csum = 0;
>> +
>> +		if (flags & BTRFS_EXTENT_FLAG_DATA) {
>> +			/* push csums to sbio */
>> +			have_csum = scrub_find_csum(sdev, logical, l, csum);
>> +			if (have_csum == 0)
>> +				++sdev->stat.no_csum;
>> +		}
>> +		ret = scrub_page(sdev, logical, l, physical, flags, gen,
>> +		                 mirror_num, have_csum ? csum : NULL, 0);
>> +		if (ret)
>> +			return ret;
>> +		len -= l;
>> +		logical += l;
>> +		physical += l;
>> +	}
>> +	return 0;
>> +}
>> +
>> +static noinline_for_stack int scrub_stripe(struct scrub_dev *sdev,
>> +	struct map_lookup *map, int num, u64 base, u64 length)
>> +{
>> +	struct btrfs_path *path;
>> +	struct btrfs_fs_info *fs_info = sdev->dev->dev_root->fs_info;
>> +	struct btrfs_root *root = fs_info->extent_root;
>> +	struct btrfs_root *csum_root = fs_info->csum_root;
>> +	struct btrfs_extent_item *extent;
>> +	u64 flags;
>> +	int ret;
>> +	int slot;
>> +	int i;
>> +	int nstripes;
>> +	int start_stripe;
>> +	struct extent_buffer *l;
>> +	struct btrfs_key key;
>> +	u64 physical;
>> +	u64 logical;
>> +	u64 generation;
>> +	u64 mirror_num;
>> +
>> +	u64 increment = map->stripe_len;
>> +	u64 offset;
>> +
>> +	nstripes = length;
>> +	offset = 0;
>> +	do_div(nstripes, map->stripe_len);
>> +	if (map->type & BTRFS_BLOCK_GROUP_RAID0) {
>> +		offset = map->stripe_len * num;
>> +		increment = map->stripe_len * map->num_stripes;
>> +		mirror_num = 0;
>> +	} else if (map->type & BTRFS_BLOCK_GROUP_RAID10) {
>> +		int factor = map->num_stripes / map->sub_stripes;
>> +		offset = map->stripe_len * (num / map->sub_stripes);
>> +		increment = map->stripe_len * factor;
>> +		mirror_num = num % map->sub_stripes;
>> +	} else if (map->type & BTRFS_BLOCK_GROUP_RAID1) {
>> +		increment = map->stripe_len;
>> +		mirror_num = num % map->num_stripes;
>> +	} else if (map->type & BTRFS_BLOCK_GROUP_DUP) {
>> +		increment = map->stripe_len;
>> +		mirror_num = num % map->num_stripes;
>> +	} else {
>> +		increment = map->stripe_len;
>> +		mirror_num = 0;
>> +	}
>> +
>> +	path = btrfs_alloc_path();
>> +	if (!path)
>> +		return -ENOMEM;
>> +
>> +	path->reada = 2;
>> +	path->search_commit_root = 1;
>> +	path->skip_locking = 1;
>> +
>> +	/*
>> +	 * find all extents for each stripe and just read them to get
>> +	 * them into the page cache
>> +	 * FIXME: we can do better. build a more intelligent prefetching
>> +	 */
>> +	logical = base + offset;
>> +	physical = map->stripes[num].physical;
>> +	ret = 0;
>> +	for (i = 0; i < nstripes; ++i) {
>> +		key.objectid = logical;
>> +		key.type = BTRFS_EXTENT_ITEM_KEY;
>> +		key.offset = (u64)0;
>> +
>> +		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
>> +		if (ret < 0)
>> +			goto out;
>> +
>> +		l = path->nodes[0];
>> +		slot = path->slots[0];
>> +		btrfs_item_key_to_cpu(l, &key, slot);
>> +		if (key.objectid != logical) {
>> +			ret = btrfs_previous_item(root, path, 0,
>> +			                          BTRFS_EXTENT_ITEM_KEY);
>> +			if (ret < 0)
>> +				goto out;
>> +		}
>> +
>> +		while (1) {
>> +			l = path->nodes[0];
>> +			slot = path->slots[0];
>> +			if (slot >= btrfs_header_nritems(l)) {
>> +				ret = btrfs_next_leaf(root, path);
>> +				if (ret == 0)
>> +					continue;
>> +				if (ret < 0)
>> +					goto out;
>> +
>> +				break;
>> +			}
>> +			btrfs_item_key_to_cpu(l, &key, slot);
>> +
>> +			if (key.objectid + key.offset <= logical)
>> +				goto next1;
>> +
>> +			if (key.objectid >= logical + map->stripe_len)
>> +				break;
>> +next1:
>> +			path->slots[0]++;
>> +		}
>> +		btrfs_release_path(root, path);
>> +		logical += increment;
>> +		physical += map->stripe_len;
>> +		cond_resched();
>> +	}
>> +
>> +	/*
>> +	 * collect all data csums for the stripe to avoid seeking during
>> +	 * the scrub. This might currently (crc32) end up to be about 1MB
>> +	 */
>> +	start_stripe = 0;
>> +again:
>> +	logical = base + offset + start_stripe * map->stripe_len;
>> +	physical = map->stripes[num].physical + start_stripe * map->stripe_len;
>> +	for (i = start_stripe; i < nstripes; ++i) {
>> +		ret = btrfs_lookup_csums_range(csum_root, logical,
>> +		                               logical + map->stripe_len - 1,
>> +		                               &sdev->csum_list, 1);
>> +		if (ret)
>> +			goto out;
>> +
>> +		logical += increment;
>> +		cond_resched();
>> +	}
>> +	/*
>> +	 * now find all extents for each stripe and scrub them
>> +	 */
>> +	logical = base + offset + start_stripe * map->stripe_len;
>> +	physical = map->stripes[num].physical + start_stripe * map->stripe_len;
>> +	ret = 0;
>> +	for (i = start_stripe; i < nstripes; ++i) {
>> +		/*
>> +		 * canceled?
>> +		 */
>> +		if (atomic_read(&fs_info->scrub_cancel_req) ||
>> +		    atomic_read(&sdev->cancel_req)) {
>> +			ret = -ECANCELED;
>> +			goto out;
>> +		}
>> +		/*
>> +		 * check to see if we have to pause
>> +		 */
>> +		if (atomic_read(&fs_info->scrub_pause_req)) {
>> +			/* push queued extents */
>> +			scrub_submit(sdev);
>> +			wait_event(sdev->list_wait,
>> +			           atomic_read(&sdev->in_flight) == 0);
>> +			atomic_inc(&fs_info->scrubs_paused);
>> +			wake_up(&fs_info->scrub_pause_wait);
>> +			mutex_lock(&fs_info->scrub_lock);
>> +			while(atomic_read(&fs_info->scrub_pause_req)) {
>> +				mutex_unlock(&fs_info->scrub_lock);
>> +				wait_event(fs_info->scrub_pause_wait,
>> +				   atomic_read(&fs_info->scrub_pause_req) == 0);
>> +				mutex_lock(&fs_info->scrub_lock);
>> +			}
>> +			atomic_dec(&fs_info->scrubs_paused);
>> +			mutex_unlock(&fs_info->scrub_lock);
>> +			wake_up(&fs_info->scrub_pause_wait);
>> +			scrub_free_csums(sdev);
>> +			goto again;
>> +		}
>> +
>> +		key.objectid = logical;
>> +		key.type = BTRFS_EXTENT_ITEM_KEY;
>> +		key.offset = (u64)0;
>> +
>> +		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
>> +		if (ret < 0)
>> +			goto out;
>> +
>> +		l = path->nodes[0];
>> +		slot = path->slots[0];
>> +		btrfs_item_key_to_cpu(l, &key, slot);
>> +		if (key.objectid != logical) {
>> +			ret = btrfs_previous_item(root, path, 0,
>> +			                          BTRFS_EXTENT_ITEM_KEY);
>> +			if (ret < 0)
>> +				goto out;
>> +		}
>> +
>> +		while (1) {
>> +			l = path->nodes[0];
>> +			slot = path->slots[0];
>> +			if (slot >= btrfs_header_nritems(l)) {
>> +				ret = btrfs_next_leaf(root, path);
>> +				if (ret == 0)
>> +					continue;
>> +				if (ret < 0)
>> +					goto out;
>> +
>> +				break;
>> +			}
>> +			btrfs_item_key_to_cpu(l, &key, slot);
>> +
>> +			if (key.objectid + key.offset <= logical)
>> +				goto next;
>> +
>> +			if (key.objectid >= logical + map->stripe_len)
>> +				break;
>> +
>> +			if (btrfs_key_type(&key) != BTRFS_EXTENT_ITEM_KEY)
>> +				goto next;
>> +
>> +			extent = btrfs_item_ptr(l, slot,
>> +			                        struct btrfs_extent_item);
>> +			flags = btrfs_extent_flags(l, extent);
>> +			generation = btrfs_extent_generation(l, extent);
>> +
>> +			if (key.objectid < logical &&
>> +			    (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK)) {
>> +				printk(KERN_ERR
>> +				       "btrfs scrub: tree block %lld spanning "
>> +				       "stripes, ignored. logical=%lld\n",
>> +				       key.objectid, logical);
>> +				goto next;
>> +			}
>> +
>> +			/*
>> +			 * trim extent to this stripe
>> +			 */
>> +			if (key.objectid < logical) {
>> +				key.offset -= logical - key.objectid;
>> +				key.objectid = logical;
>> +			}
>> +			if (key.objectid + key.offset >
>> +			    logical + map->stripe_len) {
>> +				key.offset = logical + map->stripe_len -
>> +				             key.objectid;
>> +			}
>> +
>> +			ret = scrub_extent(sdev, key.objectid, key.offset,
>> +			                   key.objectid - logical + physical,
>> +			                   flags, generation, mirror_num);
>> +			if (ret)
>> +				goto out;
>> +next:
>> +			path->slots[0]++;
>> +		}
>> +		btrfs_release_path(root, path);
>> +		logical += increment;
>> +		physical += map->stripe_len;
>> +		spin_lock(&sdev->stat_lock);
>> +		sdev->stat.last_physical = physical;
>> +		spin_unlock(&sdev->stat_lock);
>> +	}
>> +	/* push queued extents */
>> +	scrub_submit(sdev);
>> +
>> +out:
>> +	btrfs_free_path(path);
>> +	return ret < 0 ? ret : 0;
>> +}
>> +
>> +static noinline_for_stack int scrub_chunk(struct scrub_dev *sdev, 
>> +	u64 chunk_tree, u64 chunk_objectid, u64 chunk_offset, u64 length)
>> +{
>> +	struct btrfs_mapping_tree *map_tree =
>> +		&sdev->dev->dev_root->fs_info->mapping_tree;
>> +	struct map_lookup *map;
>> +	struct extent_map *em;
>> +	int i;
>> +	int ret;
>> +
>> +	read_lock(&map_tree->map_tree.lock);
>> +	em = lookup_extent_mapping(&map_tree->map_tree, chunk_offset, 1);
>> +	read_unlock(&map_tree->map_tree.lock);
>> +
>> +	if (!em)
>> +		return -EINVAL;
>> +
>> +	map = (struct map_lookup *)em->bdev;
>> +	if (em->start != chunk_offset)
>> +		return -EINVAL;
>> +
>> +	if (em->len < length)
>> +		return -EINVAL;
>> +
>> +	for (i = 0; i < map->num_stripes; ++i) {
>> +		if (map->stripes[i].dev == sdev->dev) {
>> +			ret = scrub_stripe(sdev, map, i, chunk_offset, length);
>> +			if (ret)
>> +				return ret;
>> +		}
>> +	}
>> +	return 0;
>> +}
>> +
>> +static noinline_for_stack
>> +int scrub_enumerate_chunks(struct scrub_dev *sdev, u64 start, u64 end)
>> +{
>> +	struct btrfs_dev_extent *dev_extent = NULL;
>> +	struct btrfs_path *path;
>> +	struct btrfs_root *root = sdev->dev->dev_root;
>> +	struct btrfs_fs_info *fs_info = root->fs_info;
>> +	u64 length;
>> +	u64 chunk_tree;
>> +	u64 chunk_objectid;
>> +	u64 chunk_offset;
>> +	int ret;
>> +	int slot;
>> +	struct extent_buffer *l;
>> +	struct btrfs_key key;
>> +	struct btrfs_key found_key;
>> +	struct btrfs_block_group_cache *cache;
>> +
>> +	path = btrfs_alloc_path();
>> +	if (!path)
>> +		return -ENOMEM;
>> +
>> +	path->reada = 2;
>> +	path->search_commit_root = 1;
>> +	path->skip_locking = 1;
>> +
>> +	key.objectid = sdev->dev->devid;
>> +	key.offset = 0ull;
>> +	key.type = BTRFS_DEV_EXTENT_KEY;
>> +
>> +
>> +	while (1) {
>> +		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
>> +		if (ret < 0)
>> +			goto out;
>> +		ret = 0;
>> +
>> +		l = path->nodes[0];
>> +		slot = path->slots[0];
>> +
>> +		btrfs_item_key_to_cpu(l, &found_key, slot);
>> +
>> +		if (found_key.objectid != sdev->dev->devid)
>> +			break;
>> +
>> +		if (btrfs_key_type(&key) != BTRFS_DEV_EXTENT_KEY)
>> +			break;
>> +
>> +		if (found_key.offset >= end)
>> +			break;
>> +
>> +		if (found_key.offset < key.offset)
>> +			break;
>> +
>> +		dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent);
>> +		length = btrfs_dev_extent_length(l, dev_extent);
>> +
>> +		if (found_key.offset + length <= start) {
>> +			key.offset = found_key.offset + length;
>> +			btrfs_release_path(root, path);
>> +			continue;
>> +		}
>> +
>> +		chunk_tree = btrfs_dev_extent_chunk_tree(l, dev_extent);
>> +		chunk_objectid = btrfs_dev_extent_chunk_objectid(l, dev_extent);
>> +		chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent);
>> +
>> +		/*
>> +		 * get a reference on the corresponding block group to prevent
>> +		 * the chunk from going away while we scrub it
>> +		 */
>> +		cache = btrfs_lookup_block_group(fs_info, chunk_offset);
>> +		if (!cache) {
>> +			ret = -ENOENT;
>> +			goto out;
>> +		}
>> +		ret = scrub_chunk(sdev, chunk_tree, chunk_objectid,
>> +		                  chunk_offset, length);
>> +		btrfs_put_block_group(cache);
>> +		if (ret)
>> +			break;
>> +
>> +		key.offset = found_key.offset + length;
>> +		btrfs_release_path(root, path);
>> +	}
>> +
>> +out:
>> +	btrfs_free_path(path);
>> +	return ret;
>> +}
>> +
>> +static noinline_for_stack int scrub_supers(struct scrub_dev *sdev)
>> +{
>> +	int	i;
>> +	u64	bytenr;
>> +	u64	gen;
>> +	int	ret;
>> +	struct btrfs_device *device = sdev->dev;
>> +	struct btrfs_root *root = device->dev_root;
>> +
>> +	gen = root->fs_info->last_trans_committed;
>> +
>> +	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
>> +		bytenr = btrfs_sb_offset(i);
>> +		if (bytenr + BTRFS_SUPER_INFO_SIZE >= device->total_bytes)
>> +			break;
>> +
>> +		ret = scrub_page(sdev, bytenr, PAGE_SIZE, bytenr, 
>> +		                 BTRFS_EXTENT_FLAG_SUPER, gen, i, NULL, 1);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +	wait_event(sdev->list_wait, atomic_read(&sdev->in_flight) == 0);
>> +
>> +	return 0;
>> +}
>> +
>> +/*
>> + * get a reference count on fs_info->scrub_workers. start worker if necessary
>> + */
>> +static noinline_for_stack int scrub_workers_get(struct btrfs_root *root)
>> +{
>> +	struct btrfs_fs_info *fs_info = root->fs_info;
>> +
>> +	mutex_lock(&fs_info->scrub_lock);
>> +	if (fs_info->scrub_workers_refcnt == 0) {
>> +#ifdef SCRUB_BTRFS_WORKER
>> +		btrfs_start_workers(&fs_info->scrub_workers, 1);
>> +#else
>> +		fs_info->scrub_workers = create_workqueue("scrub");
>> +		if (!fs_info->scrub_workers) {
>> +			mutex_unlock(&fs_info->scrub_lock);
>> +			return -ENOMEM;
>> +		}
>> +#endif
>> +	}
>> +	++fs_info->scrub_workers_refcnt;
>> +	mutex_unlock(&fs_info->scrub_lock);
>> +
>> +	return 0;
>> +}
>> +
>> +static noinline_for_stack void scrub_workers_put(struct btrfs_root *root)
> 
> This func is always called immediately after a mutex_unlock(scrub_lock),
> and then takes the lock again. I suggest to drop locking here and adjust
> all callsites.
> 

This does hold only for 2 out of 4 calls. I don't know if it's
worth it, as this is only a very low frequency path.

> Same applies for scrub_workers_get()
> 
>> +{
>> +	struct btrfs_fs_info *fs_info = root->fs_info;
>> +	
>> +	mutex_lock(&fs_info->scrub_lock);
>> +	if (--fs_info->scrub_workers_refcnt == 0) {
>> +#ifdef SCRUB_BTRFS_WORKER
>> +		btrfs_stop_workers(&fs_info->scrub_workers);
>> +#else
>> +		destroy_workqueue(fs_info->scrub_workers);
>> +		fs_info->scrub_workers = NULL;
>> +#endif
>> +
>> +	}
>> +	WARN_ON(fs_info->scrub_workers_refcnt < 0);
>> +	mutex_unlock(&fs_info->scrub_lock);
>> +}
>> +
>> +
>> +int btrfs_scrub_dev(struct btrfs_root *root, u64 devid, u64 start, u64 end,
>> +                    struct btrfs_scrub_progress *progress)
>> +{
>> +	struct scrub_dev *sdev;
>> +	struct btrfs_fs_info *fs_info = root->fs_info;
>> +	int ret;
>> +	struct btrfs_device *dev;
>> +
>> +	if (root->fs_info->closing)
>> +		return -EINVAL;
>> +
>> +	/*
>> +	 * check some assumptions
>> +	 */
>> +	if (root->sectorsize != PAGE_SIZE ||
>> +	    root->sectorsize != root->leafsize ||
>> +	    root->sectorsize != root->nodesize) {
>> +		printk(KERN_ERR "btrfs_scrub: size assumptions fail\n");
>> +		return -EINVAL;
>> +	}
>> +	    
>> +	ret = scrub_workers_get(root);
>> +	if (ret)
>> +		return ret;
>> +
>> +	mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
>> +	dev = btrfs_find_device(root, devid, NULL, NULL);
>> +	if (!dev || dev->missing) {
>> +		mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
>> +		scrub_workers_put(root);
>> +		return -ENODEV;
>> +	}
>> +
>> +	mutex_lock(&fs_info->scrub_lock);
>> +	if (dev->scrub_device) {
>> +		mutex_unlock(&fs_info->scrub_lock);
>> +		scrub_workers_put(root);
>> +		return -EINPROGRESS;
>> +	}
>> +	sdev = scrub_setup_dev(dev);
>> +	if (IS_ERR(sdev)) {
>> +		mutex_unlock(&fs_info->scrub_lock);
>> +		scrub_workers_put(root);
>> +		return PTR_ERR(sdev);
>> +	}
>> +	dev->scrub_device = sdev;
>> +
>> +	atomic_inc(&fs_info->scrubs_running);
>> +	mutex_unlock(&fs_info->scrub_lock);
>> +	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
>> +
>> +	down_read(&fs_info->scrub_super_lock);
>> +	ret = scrub_supers(sdev);
>> +	up_read(&fs_info->scrub_super_lock);
>> +
>> +	if (!ret)
>> +		ret = scrub_enumerate_chunks(sdev, start, end);
>> +
>> +	wait_event(sdev->list_wait, atomic_read(&sdev->in_flight) == 0);
>> +
>> +	mutex_lock(&fs_info->scrub_lock);
>> +	atomic_dec(&fs_info->scrubs_running);
>> +	mutex_unlock(&fs_info->scrub_lock);
>> +	wake_up(&fs_info->scrub_pause_wait);
>> +
>> +	if (progress)
>> +		memcpy(progress, &sdev->stat, sizeof(*progress));
>> +
>> +	mutex_lock(&fs_info->scrub_lock);
>> +	dev->scrub_device = NULL;
>> +	mutex_unlock(&fs_info->scrub_lock);
>> +
>> +	scrub_free_dev(sdev);
>> +	scrub_workers_put(root);
>> +
>> +	return ret;
>> +}
>> +
>> +int btrfs_scrub_pause(struct btrfs_root *root)
>> +{
>> +	struct btrfs_fs_info *fs_info = root->fs_info;
>> +	mutex_lock(&fs_info->scrub_lock);
>> +	atomic_inc(&fs_info->scrub_pause_req);
>> +	while (atomic_read(&fs_info->scrubs_paused) !=
>> +	       atomic_read(&fs_info->scrubs_running)) {
>> +		mutex_unlock(&fs_info->scrub_lock);
>> +		wait_event(fs_info->scrub_pause_wait,
>> +			   atomic_read(&fs_info->scrubs_paused) ==
>> +			   atomic_read(&fs_info->scrubs_running));
>> +		mutex_lock(&fs_info->scrub_lock);
>> +	}
>> +	mutex_unlock(&fs_info->scrub_lock);
>> +
>> +	return 0;
>> +}
>> +
>> +int btrfs_scrub_continue(struct btrfs_root *root)
>> +{
>> +	struct btrfs_fs_info *fs_info = root->fs_info;
>> +
>> +	atomic_dec(&fs_info->scrub_pause_req);
>> +	wake_up(&fs_info->scrub_pause_wait);
>> +	return 0;
>> +}
>> +
>> +int btrfs_scrub_pause_super(struct btrfs_root *root)
>> +{
>> +	down_write(&root->fs_info->scrub_super_lock);
>> +	return 0;
>> +}
>> +
>> +int btrfs_scrub_continue_super(struct btrfs_root *root)
>> +{
>> +	up_write(&root->fs_info->scrub_super_lock);
>> +	return 0;
>> +}
>> +
>> +int btrfs_scrub_cancel(struct btrfs_root *root)
>> +{
>> +	struct btrfs_fs_info *fs_info = root->fs_info;
>> +	mutex_lock(&fs_info->scrub_lock);
>> +	if (!atomic_read(&fs_info->scrubs_running)) {
>> +		mutex_unlock(&fs_info->scrub_lock);
>> +		return -ENOTCONN;
>> +	}
>> +
>> +	atomic_inc(&fs_info->scrub_cancel_req);
>> +	while(atomic_read(&fs_info->scrubs_running)) {
>> +		mutex_unlock(&fs_info->scrub_lock);
>> +		wait_event(fs_info->scrub_pause_wait,
>> +			   atomic_read(&fs_info->scrubs_running) == 0);
>> +		mutex_lock(&fs_info->scrub_lock);
>> +	}
>> +	atomic_dec(&fs_info->scrub_cancel_req);
>> +	mutex_unlock(&fs_info->scrub_lock);
>> +	
>> +	return 0;
>> +}
>> +
>> +int btrfs_scrub_cancel_dev(struct btrfs_root *root, struct btrfs_device *dev)
>> +{
>> +	struct btrfs_fs_info *fs_info = root->fs_info;
>> +	struct scrub_dev *sdev;
>> +
>> +	mutex_lock(&fs_info->scrub_lock);
>> +	sdev = dev->scrub_device;
>> +	if (!sdev) {
>> +		mutex_unlock(&fs_info->scrub_lock);
>> +		return -ENOTCONN;
>> +	}
>> +	atomic_inc(&sdev->cancel_req);
>> +	while(dev->scrub_device) {
>> +		mutex_unlock(&fs_info->scrub_lock);
>> +		wait_event(fs_info->scrub_pause_wait,
>> +		           dev->scrub_device == NULL);
>> +		mutex_lock(&fs_info->scrub_lock);
>> +	}
>> +	mutex_unlock(&fs_info->scrub_lock);
>> +		
>> +	return 0;
>> +}
>> +int btrfs_scrub_cancel_devid(struct btrfs_root *root, u64 devid)
>> +{
>> +	struct btrfs_fs_info *fs_info = root->fs_info;
>> +	struct btrfs_device *dev;
>> +	int ret;
>> +
>> +	/*
>> +	 * we have to hold the device_list_mutex here so the device
>> +	 * does not go away in cancel_dev. FIXME: find a better solution
>> +	 */
>> +	mutex_lock(&fs_info->fs_devices->device_list_mutex);
>> +	dev = btrfs_find_device(root, devid, NULL, NULL);
>> +	if (!dev) {
>> +		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>> +		return -ENODEV;
>> +	}
>> +	ret = btrfs_scrub_cancel_dev(root, dev);
>> +	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>> +
>> +	return ret;
>> +}
>> +	
>> +int btrfs_scrub_progress(struct btrfs_root *root, u64 devid,
>> +                         struct btrfs_scrub_progress *progress)
>> +{
>> +	struct btrfs_device *dev;
>> +	struct scrub_dev *sdev = NULL;
>> +
>> +	mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
>> +	dev = btrfs_find_device(root, devid, NULL, NULL);
>> +	if (dev)
>> +		sdev = dev->scrub_device;
>> +	if (sdev)
>> +		memcpy(progress, &sdev->stat, sizeof(*progress));
>> +	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
>> +
>> +	return dev ? (sdev ? 0 : -ENOTCONN) : -ENODEV;
>> +}
>> -- 
>> 1.7.3.4
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 5/6] btrfs: add state information for scrub
  2011-03-11 16:53   ` David Sterba
@ 2011-03-12 13:13     ` Arne Jansen
  0 siblings, 0 replies; 15+ messages in thread
From: Arne Jansen @ 2011-03-12 13:13 UTC (permalink / raw)
  To: linux-btrfs, dave

David Sterba wrote:
> On Fri, Mar 11, 2011 at 03:49:42PM +0100, Arne Jansen wrote:
>> Add structures and state information needed for scrub
>>
>> Signed-off-by: Arne Jansen <sensille@gmx.net>
>> ---
>>  fs/btrfs/ctree.h   |   26 ++++++++++++++++++++++++++
>>  fs/btrfs/disk-io.c |   15 +++++++++++++++
>>  fs/btrfs/ioctl.h   |   17 +++++++++++++++++
>>  fs/btrfs/volumes.h |    3 +++
>>  4 files changed, 61 insertions(+), 0 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 030c321..3584179 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -23,6 +23,7 @@
>>  #include <linux/mm.h>
>>  #include <linux/highmem.h>
>>  #include <linux/fs.h>
>> +#include <linux/rwsem.h>
>>  #include <linux/completion.h>
>>  #include <linux/backing-dev.h>
>>  #include <linux/wait.h>
>> @@ -32,6 +33,7 @@
>>  #include "extent_io.h"
>>  #include "extent_map.h"
>>  #include "async-thread.h"
>> +#include "ioctl.h"
>>  
>>  struct btrfs_trans_handle;
>>  struct btrfs_transaction;
>> @@ -48,6 +50,8 @@ struct btrfs_ordered_sum;
>>  
>>  #define BTRFS_COMPAT_EXTENT_TREE_V0
>>  
>> +#define SCRUB_BTRFS_WORKER
>> +
>>  /*
>>   * files bigger than this get some pre-flushing when they are added
>>   * to the ordered operations list.  That way we limit the total
>> @@ -508,6 +512,12 @@ struct btrfs_extent_item_v0 {
>>  /* use full backrefs for extent pointers in the block */
>>  #define BTRFS_BLOCK_FLAG_FULL_BACKREF	(1ULL << 8)
>>  
>> +/*
>> + * this flag is only used internally by scrub and may be changed at any time
>> + * it is only declared here to avoid collisions
>> + */
>> +#define BTRFS_EXTENT_FLAG_SUPER		(1ULL << 48)
>> +
>>  struct btrfs_tree_block_info {
>>  	struct btrfs_disk_key key;
>>  	u8 level;
>> @@ -1067,6 +1077,22 @@ struct btrfs_fs_info {
>>  
>>  	void *bdev_holder;
>>  
>> +	/* private scrub information */
>> +	struct mutex scrub_lock;
>> +	struct scrub_info *scrub_info;
>                            ^^^^^^^^^^
> 
> I did not find any reference to this item

right, thanks.

> 
>> +	atomic_t scrubs_running;
>> +	atomic_t scrub_pause_req;
>> +	atomic_t scrubs_paused;
>> +	atomic_t scrub_cancel_req;
> 
> This make me think ... you declare atomics and yet lock (nearly) every
> variable use like 
> 
> +       mutex_lock(&fs_info->scrub_lock);
> +       atomic_inc(&fs_info->scrubs_running);
> +       mutex_unlock(&fs_info->scrub_lock);
> 
> or
> 
> +       mutex_lock(&fs_info->scrub_lock);
> +       if (!atomic_read(&fs_info->scrubs_running)) {
> +               mutex_unlock(&fs_info->scrub_lock);
> +               return -ENOTCONN;
> +       }
> 
> imho this is not needed with atomics. Moreover, the locking is not
> consistent, quick grep for atomic_read shows many statements without any
> locks around.

Ok, let's look at them one at a time.
scrubs_running is always accessed with mutex scrub_lock held, except
one time. This one time is when it is being use as a condition to
wait_event. The problem here is that I don't know how to protect the
condition. What I'm missing here are condition variables. So for this
one case, I made it atomic. The question is if I can omit the lock
around the other uses. I'm not sure about that, because some code may
assume that the value doesn't change while it is holding the lock.
I'll read it carefully in this regard again.
scrub_pause_req is used mostly without locks, in performance critical
places, so it should stay atomic.
scrubs_pauses is also used without a lock several times.
Same for scrub_cancel_req. It's also performance critical.
scrub_workers_refcnt is always used inside a lock. I don't think an
atomic would do, as the first increment can happen in several threads
simultaneously, and the second thread already expects the data structures
to be set up.

So mainly I'll see if I can omit the locking around the uses of
scrubs_running.

> 
> 
>> +	wait_queue_head_t scrub_pause_wait;
>> +	struct rw_semaphore scrub_super_lock;
>> +	int scrub_workers_refcnt;
> 
> A refcount could be an atomic too ...
> 
>> +#ifdef SCRUB_BTRFS_WORKER
>> +	struct btrfs_workers scrub_workers;
>> +#else
>> +	struct workqueue_struct *scrub_workers;
>> +#endif
>> +
> 
> Apart from the atomics and scrub_workers_refcnt, there is only
> scrub_workers left that needs locking protection, which can be done
> under a spinlock.

I wouldn't call create_workqueue under the protection of a spinlock.
Of course I could call it upfront and free it, if scrub_workers
already exists, but that isn't worth the effort, as this path is
only used on start and end of scrub.

Again, thanks for taking the time to review this.

Arne

> 
> 
> dave
> 
>>  	/* filesystem state */
>>  	u64 fs_state;
>>  };
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 924a366..4d62bc3 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -1677,6 +1677,21 @@ struct btrfs_root *open_ctree(struct super_block *sb,
>>  	INIT_LIST_HEAD(&fs_info->ordered_extents);
>>  	spin_lock_init(&fs_info->ordered_extent_lock);
>>  
>> +	mutex_init(&fs_info->scrub_lock);
>> +	atomic_set(&fs_info->scrubs_running, 0);
>> +	atomic_set(&fs_info->scrub_pause_req, 0);
>> +	atomic_set(&fs_info->scrubs_paused, 0);
>> +	atomic_set(&fs_info->scrub_cancel_req, 0);
>> +	init_waitqueue_head(&fs_info->scrub_pause_wait);
>> +	init_rwsem(&fs_info->scrub_super_lock);
>> +	fs_info->scrub_workers_refcnt = 0;
>> +#ifdef SCRUB_BTRFS_WORKER
>> +	btrfs_init_workers(&fs_info->scrub_workers, "scrub",
>> +			   fs_info->thread_pool_size, &fs_info->generic_worker);
>> +#else
>> +	fs_info->scrub_workers = NULL;
>> +#endif
>> +
>>  	sb->s_blocksize = 4096;
>>  	sb->s_blocksize_bits = blksize_bits(4096);
>>  	sb->s_bdi = &fs_info->bdi;
>> diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
>> index 8fb3821..973e7c8 100644
>> --- a/fs/btrfs/ioctl.h
>> +++ b/fs/btrfs/ioctl.h
>> @@ -42,6 +42,23 @@ struct btrfs_ioctl_vol_args_v2 {
>>  	char name[BTRFS_SUBVOL_NAME_MAX + 1];
>>  };
>>  
>> +struct btrfs_scrub_progress {
>> +	__u64 data_extents_scrubbed;
>> +	__u64 tree_extents_scrubbed;
>> +	__u64 data_bytes_scrubbed;
>> +	__u64 tree_bytes_scrubbed;
>> +	__u64 read_errors;
>> +	__u64 csum_errors;
>> +	__u64 verify_errors;
>> +	__u64 no_csum;
>> +	__u64 csum_discards;
>> +	__u64 super_errors;
>> +	__u64 malloc_errors;
>> +	__u64 uncorrectable_errors;
>> +	__u64 corrected_errors;
>> +	__u64 last_physical;
>> +};
>> +
>>  #define BTRFS_INO_LOOKUP_PATH_MAX 4080
>>  struct btrfs_ioctl_ino_lookup_args {
>>  	__u64 treeid;
>> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
>> index 0ccc982..92204d9 100644
>> --- a/fs/btrfs/volumes.h
>> +++ b/fs/btrfs/volumes.h
>> @@ -86,6 +86,9 @@ struct btrfs_device {
>>  	/* physical drive uuid (or lvm uuid) */
>>  	u8 uuid[BTRFS_UUID_SIZE];
>>  
>> +	/* per-device scrub information */
>> +	struct scrub_dev *scrub_device;
>> +
>>  	struct btrfs_work work;
>>  };
>>  
>> -- 
>> 1.7.3.4
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/6] btrfs: scrub
  2011-03-11 16:17 ` [PATCH v2 0/6] btrfs: scrub Ric Wheeler
@ 2011-03-12 13:20   ` Arne Jansen
  0 siblings, 0 replies; 15+ messages in thread
From: Arne Jansen @ 2011-03-12 13:20 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: chris.mason, linux-btrfs, jansen

Ric Wheeler wrote:
> On 03/11/2011 09:49 AM, Arne Jansen wrote:
> 
> Great work!
> 
> I do wonder if we should also worry about the unallocated part of the 
> storage device. Often, with local disks specifically, your unused space 
> might accumulate errors over time.
> 
> What you add to your scrub phase is a simple read operation for the 
> unallocated ranges (optionally a READ_VERIFY which validates the data on 
> platter without transferring data over the bus to the host).
> 
> The recovery operation here would be to write (zeros) to the block if an 
> error is detected, so we might be pessimistic and simply use write to 
> "zero" those unallocated ranges as well.  Note that there are 
> "WRITE_SAME" commands that RAID people use for initializing unused 
> drives for example.
> 
> I would not run the overwrite or read check on SSD or arrays so this 
> would be an optional type of scrub I suppose.

Thanks for your suggestion. I'll definitely add scrubbing unused space in
a later revision and make it optional.
The read/verify part is quite easy (as long as I can manage to get a
READ_VERIFY through the stack). Rewriting detected errors is harder as I
have to make sure that it doesn't get allocated in the meantime. Currently
scrub is working only on the latest commit and has no knowledge of the
running transaction except that it is synchronized with it to switch to
the next commit.

Arne

> 
> Regards,
> 
> Ric
> 
>> Arne Jansen (5):
>>    btrfs: add parameter to btrfs_lookup_csum_range
>>    btrfs: make struct map_lookup public
>>    btrfs: add scrub code and prototypes
>>    btrfs: sync scrub with commit&  device removal
>>    btrfs: add state information for scrub
>>
>> Jan Schmidt (1):
>>    btrfs: new ioctls for scrub
>>
>>   fs/btrfs/Makefile      |    2 +-
>>   fs/btrfs/ctree.h       |   46 ++-
>>   fs/btrfs/disk-io.c     |   16 +
>>   fs/btrfs/file-item.c   |    8 +-
>>   fs/btrfs/inode.c       |    2 +-
>>   fs/btrfs/ioctl.c       |  131 +++++
>>   fs/btrfs/ioctl.h       |   55 ++
>>   fs/btrfs/relocation.c  |    2 +-
>>   fs/btrfs/scrub.c       | 1463 
>> ++++++++++++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/transaction.c |    3 +
>>   fs/btrfs/tree-log.c    |    6 +-
>>   fs/btrfs/volumes.c     |   16 +-
>>   fs/btrfs/volumes.h     |   17 +
>>   13 files changed, 1743 insertions(+), 24 deletions(-)
>>   create mode 100644 fs/btrfs/scrub.c
>>
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/6] btrfs: add scrub code and prototypes
  2011-03-12 10:54     ` Arne Jansen
@ 2011-03-22 16:38       ` David Sterba
  2011-03-23 14:19         ` Arne Jansen
  0 siblings, 1 reply; 15+ messages in thread
From: David Sterba @ 2011-03-22 16:38 UTC (permalink / raw)
  To: Arne Jansen; +Cc: linux-btrfs, dave

Hi,

sorry for late reply to this. (I will have another look at your git
series.)

> David Sterba wrote:
> >On Fri, Mar 11, 2011 at 03:49:40PM +0100, Arne Jansen wrote:
> >>This is the main scrub code.
> >>
> >>Signed-off-by: Arne Jansen <sensille@gmx.net>
> >>---
> >> fs/btrfs/Makefile |    2 +-
> >> fs/btrfs/ctree.h  |   14 +
> >> fs/btrfs/scrub.c  | 1463 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> 3 files changed, 1478 insertions(+), 1 deletions(-)
> >>
> >>diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> >>index 31610ea..8fda313 100644
> >>--- a/fs/btrfs/Makefile
> >>+++ b/fs/btrfs/Makefile
> >>@@ -7,4 +7,4 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
> >> 	   extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \
> >> 	   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
> >> 	   export.o tree-log.o acl.o free-space-cache.o zlib.o lzo.o \
> >>-	   compression.o delayed-ref.o relocation.o
> >>+	   compression.o delayed-ref.o relocation.o scrub.o
> >>diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> >>index 4c99834..030c321 100644
> >>--- a/fs/btrfs/ctree.h
> >>+++ b/fs/btrfs/ctree.h
> >>@@ -2610,4 +2610,18 @@ void btrfs_reloc_pre_snapshot(struct btrfs_trans_handle *trans,
> >> 			      u64 *bytes_to_reserve);
> >> void btrfs_reloc_post_snapshot(struct btrfs_trans_handle *trans,
> >> 			      struct btrfs_pending_snapshot *pending);
> >>+
> >>+/* scrub.c */
> >>+int btrfs_scrub_dev(struct btrfs_root *root, u64 devid, u64 start, u64 end,
> >>+                    struct btrfs_scrub_progress *progress);
> >>+int btrfs_scrub_pause(struct btrfs_root *root);
> >>+int btrfs_scrub_pause_super(struct btrfs_root *root);
> >>+int btrfs_scrub_continue(struct btrfs_root *root);
> >>+int btrfs_scrub_continue_super(struct btrfs_root *root);
> >>+int btrfs_scrub_cancel(struct btrfs_root *root);
> >>+int btrfs_scrub_cancel_dev(struct btrfs_root *root, struct btrfs_device *dev);
> >>+int btrfs_scrub_cancel_devid(struct btrfs_root *root, u64 devid);
> >>+int btrfs_scrub_progress(struct btrfs_root *root, u64 devid,
> >>+                         struct btrfs_scrub_progress *progress);
> >>+
> >> #endif
> >>diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> >>new file mode 100644
> >>index 0000000..d606f4d
> >>--- /dev/null
> >>+++ b/fs/btrfs/scrub.c
> >>@@ -0,0 +1,1463 @@
> >>+/*
> >>+ * Copyright (C) 2011 STRATO.  All rights reserved.
> >>+ *
> >>+ * This program is free software; you can redistribute it and/or
> >>+ * modify it under the terms of the GNU General Public
> >>+ * License v2 as published by the Free Software Foundation.
> >>+ *
> >>+ * This program is distributed in the hope that it will be useful,
> >>+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >>+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >>+ * General Public License for more details.
> >>+ *
> >>+ * You should have received a copy of the GNU General Public
> >>+ * License along with this program; if not, write to the
> >>+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> >>+ * Boston, MA 021110-1307, USA.
> >>+ */
> >>+
> >>+#include <linux/sched.h>
> >>+#include <linux/pagemap.h>
> >>+#include <linux/writeback.h>
> >>+#include <linux/blkdev.h>
> >>+#include <linux/rbtree.h>
> >>+#include <linux/slab.h>
> >>+#include <linux/workqueue.h>
> >>+#include "ctree.h"
> >>+#include "volumes.h"
> >>+#include "disk-io.h"
> >>+#include "ordered-data.h"
> >>+
> >>+/*
> >>+ * This is only the first step towards a full-features scrub. It reads all
> >>+ * extent and super block and verifies the checksums. In case a bad checksum
> >>+ * is found or the extent cannot be read, good data will be written back if
> >>+ * any can be found.
> >>+ *
> >>+ * Future enhancements:
> >>+ *  - To enhance the performance, better read-ahead strategies for the
> >>+ *    extent-tree can be employed.
> >>+ *  - In case an unrepairable extent is encountered, track which files are
> >>+ *    affected and report them
> >>+ *  - In case of a read error on files with nodatasum, map the file and read
> >>+ *    the extent to trigger a writeback of the good copy
> >>+ *  - track and record media errors, throw out bad devices
> >>+ *  - add a readonly mode
> >>+ *  - add a mode to also read unallocated space
> >>+ */
> >>+
> >>+#ifdef SCRUB_BTRFS_WORKER
> >>+typedef struct btrfs_work scrub_work_t;
> >>+#define SCRUB_INIT_WORK(work, fn) do { (work)->func = (fn); } while (0)
> >>+#define SCRUB_QUEUE_WORK(wq, w) do { btrfs_queue_worker(&(wq), w); } while (0)
> >>+#else
> >>+typedef struct work_struct scrub_work_t;
> >>+#define SCRUB_INIT_WORK INIT_WORK
> >>+#define SCRUB_QUEUE_WORK queue_work
> >>+#endif
> >>+
> >>+struct scrub_bio;
> >>+struct scrub_page;
> >>+struct scrub_dev;
> >>+struct scrub_fixup;
> >>+static void scrub_bio_end_io(struct bio *bio, int err);
> >>+static void scrub_checksum(scrub_work_t *work);
> >>+static int scrub_checksum_data(struct scrub_dev *sdev,
> >>+                               struct scrub_page *spag, void *buffer);
> >>+static int scrub_checksum_tree_block(struct scrub_dev *sdev,
> >>+                                     struct scrub_page *spag, u64 logical,
> >>+                                     void *buffer);
> >>+static int scrub_checksum_super(struct scrub_bio *sbio, void *buffer);
> >>+static void scrub_recheck_end_io(struct bio *bio, int err);
> >>+static void scrub_fixup_worker(scrub_work_t *work);
> >>+static void scrub_fixup(struct scrub_fixup *fixup);
> >>+
> >>+#define SCRUB_PAGES_PER_BIO	16	/* 64k per bio */
> >>+#define SCRUB_BIOS_PER_DEV	16	/* 1 MB per device in flight */
> >>+
> >>+struct scrub_page {
> >>+	u64			flags;  /* extent flags */
> >>+	u64			generation;
> >>+	u64			mirror_num;
> >>+	int			have_csum;
> >>+	u8			csum[BTRFS_CSUM_SIZE];
> >>+};
> >>+
> >>+struct scrub_bio {
> >>+	int			index;
> >>+	struct scrub_dev	*sdev;
> >>+	struct bio		*bio;
> >>+	int			err;
> >>+	u64			logical;
> >>+	u64			physical;
> >>+	struct scrub_page	spag[SCRUB_PAGES_PER_BIO];
> >>+	u64			count;
> >>+	int			next_free;
> >>+	scrub_work_t		work;
> >>+};
> >>+
> >>+struct scrub_dev {
> >>+	struct scrub_bio	bios[SCRUB_BIOS_PER_DEV];
> >
> >sizeof(struct scrub_bio) == 1160
> >SCRUB_BIOS_PER_DEV == 16
> >
> >>+	struct btrfs_device	*dev;
> >>+	int			first_free;
> >>+	int			curr;
> >>+	atomic_t		in_flight;
> >>+	spinlock_t		list_lock;
> >>+	wait_queue_head_t	list_wait;
> >>+	u16			csum_size;
> >>+	struct list_head	csum_list;
> >>+	atomic_t		cancel_req;
> >>+	/*
> >>+	 * statistics
> >>+	 */
> >>+	struct btrfs_scrub_progress stat;
> >>+	spinlock_t		stat_lock;
> >>+};
> >
> >sizeof(struct scrub_dev) == 18760 on an x86_64, an order 3 allocation in
> >scrub_setup_dev()
> 
> Is this a problem? There are only few allocations of it, one per device.

High order allocations may fail when memory is fragmented, and should be
avoided when possible. (And it is here, allocate each 'struct scrub_bio'
separately and fill the bios array with pointers.) Scrub ioctl may fail
to start until order-3 allocation will be available.

> >>+
> >>+struct scrub_fixup {
> >>+	struct scrub_dev	*sdev;
> >>+	struct bio		*bio;
> >>+	u64			logical;
> >>+	u64			physical;
> >>+	struct scrub_page	spag;
> >>+	scrub_work_t		work;
> >>+	int			err;
> >>+	int			recheck;
> >>+};
> >>+
> >>+static void scrub_free_csums(struct scrub_dev *sdev)
> >>+{
> >>+	while(!list_empty(&sdev->csum_list)) {
> >>+		struct btrfs_ordered_sum *sum;
> >>+		sum = list_first_entry(&sdev->csum_list,
> >>+		                       struct btrfs_ordered_sum, list);
> >>+		list_del(&sum->list);
> >>+		kfree(sum);
> >>+	}
> >>+}
> >>+
> >>+static noinline_for_stack void scrub_free_dev(struct scrub_dev *sdev)
> >>+{
> >>+	int i;
> >>+	int j;
> >>+	struct page *last_page;
> >>+
> >>+	if (!sdev)
> >>+		return;
> >>+
> >>+	for (i = 0; i < SCRUB_BIOS_PER_DEV; ++i) {
> >>+		struct bio *bio = sdev->bios[i].bio;
> >>+		if (bio)
> >                   ^^^^^
> >stop when we found something to free?
> >
> 
> right, good catch. It's obviously the wrong way round.
> 
> >
> >>+			break;
> >>+		
> >>+		last_page = NULL;
> >>+		for (j = 0; j < bio->bi_vcnt; ++j) {
> >                                ^^^
> >and dereference it.
> >
> >>+			if (bio->bi_io_vec[i].bv_page == last_page)
> >>+				continue;
> >>+			last_page = bio->bi_io_vec[i].bv_page;
> >>+			__free_page(last_page);
> >>+		}
> >>+		bio_put(sdev->bios[i].bio);
> >>+	}
> >>+
> >>+	scrub_free_csums(sdev);
> >>+	kfree(sdev);
> >>+}
> >>+
> >>+static noinline_for_stack
> >>+struct scrub_dev *scrub_setup_dev(struct btrfs_device *dev)
> >>+{
> >>+	struct scrub_dev *sdev;
> >>+	int		i;
> >>+	int		j;
> >>+	int		ret;
> >>+	struct btrfs_fs_info *fs_info = dev->dev_root->fs_info;
> >
> >(coding style expects a newline here)
> 
> coding style issues are always the gravest. Hope you never catch me with
> a line > 80 columns ;)

No, I will not, my monitor is wide enough :) but seprating declarations
from code is IMO useful, I can clearly what variables will be used and
then how, with visual separator. It really helps reviewing, the code has
a smoother flow :)

> 
> >
> >>+	sdev = kzalloc(sizeof(*sdev), GFP_NOFS);
> >>+	if (!sdev)
> >>+		goto nomem;
> >>+	sdev->dev = dev;
> >>+	for (i = 0; i < SCRUB_BIOS_PER_DEV; ++i) {
> >>+		struct bio *bio;
> >>+
> >>+		bio = bio_alloc(GFP_NOFS, SCRUB_PAGES_PER_BIO);
> >>+		if (!bio)
> >>+			goto nomem;
> >>+
> >>+		sdev->bios[i].index = i;
> >>+		sdev->bios[i].sdev = sdev;
> >>+		sdev->bios[i].bio = bio;
> >>+		sdev->bios[i].count = 0;
> >>+		SCRUB_INIT_WORK(&sdev->bios[i].work, scrub_checksum);
> >>+		bio->bi_private = sdev->bios + i;
> >>+		bio->bi_end_io = scrub_bio_end_io;
> >>+		bio->bi_sector = 0;
> >>+		bio->bi_bdev = dev->bdev;
> >>+		bio->bi_size = 0;
> >>+
> >>+		for (j = 0; j < SCRUB_PAGES_PER_BIO; ++j) {
> >>+			struct page *page;
> >>+			page = alloc_page(GFP_NOFS);
> >>+			if (!page)
> >>+				goto nomem;
> >>+
> >>+			ret = bio_add_page(bio, page, PAGE_SIZE, 0);
> >>+			if (!ret)
> >>+				goto nomem;
> >>+		}
> >>+		WARN_ON(bio->bi_vcnt != SCRUB_PAGES_PER_BIO);
> >>+
> >>+		if (i != SCRUB_BIOS_PER_DEV-1)
> >>+			sdev->bios[i].next_free = i + 1;
> >>+		 else
> >>+			sdev->bios[i].next_free = -1;
> >>+	}
> >>+	sdev->first_free = 0;
> >>+	sdev->curr = -1;
> >>+	atomic_set(&sdev->in_flight, 0);
> >>+	atomic_set(&sdev->cancel_req, 0);
> >>+	sdev->csum_size = btrfs_super_csum_size(&fs_info->super_copy);
> >>+	INIT_LIST_HEAD(&sdev->csum_list);
> >>+	
> >>+	spin_lock_init(&sdev->list_lock);
> >>+	spin_lock_init(&sdev->stat_lock);
> >>+	init_waitqueue_head(&sdev->list_wait);
> >>+	return sdev;
> >>+
> >>+nomem:
> >>+	scrub_free_dev(sdev);
> >
> >When taking the 'goto nomem' path, either all bios are leaked, or the
> >check in scrub_free_dev is buggy ...
> >
> >>+	return ERR_PTR(-ENOMEM);
> >>+}
> >>+
> >>+/*
> >>+ * scrub_recheck_error gets called when either verification of the page
> >>+ * failed or the bio failed to read, e.g. with EIO. In the latter case,
> >>+ * recheck_error gets called for every page in the bio, even though only
> >>+ * one may be bad
> >>+ */
> >>+static void scrub_recheck_error(struct scrub_bio *sbio, int ix)
> >>+{
> >>+	struct scrub_dev *sdev = sbio->sdev;
> >>+	struct btrfs_fs_info *fs_info = sdev->dev->dev_root->fs_info;
> >>+	struct bio *bio = NULL;
> >>+	struct page *page = NULL;
> >>+	struct scrub_fixup *fixup = NULL;
> >>+	int ret;
> >>+
> >>+	/*
> >>+	 * while we're in here we do not want the transaction to commit.
> >>+	 * To prevent it, we increment scrubs_running. scrub_pause will
> >>+	 * have to wait until we're finished
> >>+	 */
> >>+	mutex_lock(&fs_info->scrub_lock);
> >>+	atomic_inc(&fs_info->scrubs_running);
> >>+	mutex_unlock(&fs_info->scrub_lock);
> >>+
> >>+	fixup = kzalloc(sizeof(*fixup), GFP_NOFS);
> >>+	if (!fixup)
> >>+		goto malloc_error;
> >>+
> >>+	fixup->logical = sbio->logical + ix * PAGE_SIZE;
> >>+	fixup->physical = sbio->physical + ix * PAGE_SIZE;
> >>+	fixup->spag = sbio->spag[ix];
> >>+	fixup->sdev = sdev;
> >>+
> >>+	bio = bio_alloc(GFP_NOFS, 1);
> >>+	if (!bio)
> >>+		goto malloc_error;
> >>+	bio->bi_private = fixup;
> >>+	bio->bi_size = 0;
> >>+	bio->bi_bdev = sdev->dev->bdev;	/* FIXME: temporary for add_page */
> >>+	fixup->bio = bio;
> >>+	fixup->recheck = 0;
> >>+
> >>+	page = alloc_page(GFP_NOFS);
> >>+	if (!page)
> >>+		goto malloc_error;
> >>+
> >>+	ret = bio_add_page(bio, page, PAGE_SIZE, 0);
> >>+	if (!ret)
> >>+		goto malloc_error;
> >>+
> >>+	if (!sbio->err) {
> >>+		/*
> >>+		 * shorter path: just a checksum error, go ahead and correct it
> >>+		 */
> >>+		scrub_fixup_worker(&fixup->work);
> >>+		return;
> >>+	}
> >>+
> >>+	/*
> >>+	 * an I/O-error occured for one of the blocks in the bio, not
> >>+	 * necessarily for this one, so first try to read it separately
> >>+	 */
> >>+	SCRUB_INIT_WORK(&fixup->work, scrub_fixup_worker);
> >>+	fixup->recheck = 1;
> >>+	bio->bi_end_io = scrub_recheck_end_io;
> >>+	bio->bi_sector = fixup->physical >> 9;
> >>+	bio->bi_bdev = sdev->dev->bdev;
> >>+	submit_bio(0, bio);
> >>+
> >>+	return;
> >>+
> >>+malloc_error:
> >>+	if (bio) +		bio_put(bio);
> >>+	if (page)
> >>+		__free_page(page);
> >>+	if (fixup)
> >>+		kfree(fixup);
> >>+	spin_lock(&sdev->stat_lock);
> >>+	++sdev->stat.malloc_errors;
> >>+	spin_unlock(&sdev->stat_lock);
> >>+	mutex_lock(&fs_info->scrub_lock);
> >>+	atomic_dec(&fs_info->scrubs_running);
> >>+	mutex_unlock(&fs_info->scrub_lock);
> >>+	wake_up(&fs_info->scrub_pause_wait);
> >>+}
> >>+
> >>+static void scrub_recheck_end_io(struct bio *bio, int err)
> >>+{
> >>+	struct scrub_fixup *fixup = bio->bi_private;
> >>+	struct btrfs_fs_info *fs_info = fixup->sdev->dev->dev_root->fs_info;
> >>+
> >>+	fixup->err = err;
> >>+	SCRUB_QUEUE_WORK(fs_info->scrub_workers, &fixup->work);
> >>+}
> >>+
> >>+static int scrub_fixup_check(struct scrub_fixup *fixup)
> >>+{
> >>+	int ret = 1;
> >>+	struct page *page;
> >>+	void *buffer;
> >>+	u64 flags = fixup->spag.flags;
> >>+
> >>+	page = fixup->bio->bi_io_vec[0].bv_page;
> >>+	buffer = kmap_atomic(page, KM_USER0);
> >>+	if (flags & BTRFS_EXTENT_FLAG_DATA) {
> >>+		ret = scrub_checksum_data(fixup->sdev,
> >>+					  &fixup->spag, buffer);
> >>+	} else if (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
> >>+		ret = scrub_checksum_tree_block(fixup->sdev,
> >>+						&fixup->spag,
> >>+						fixup->logical,
> >>+						buffer);
> >>+	} else {
> >>+		WARN_ON(1);
> >>+	}
> >>+	kunmap_atomic(buffer, KM_USER0);
> >>+
> >>+	return ret;
> >>+}
> >>+
> >>+static void scrub_fixup_worker(scrub_work_t *work)
> >>+{
> >>+	struct scrub_fixup *fixup;
> >>+	struct btrfs_fs_info *fs_info;
> >>+	u64 flags;
> >>+	int ret = 1;
> >>+
> >>+	fixup = container_of(work, struct scrub_fixup, work);
> >>+	fs_info = fixup->sdev->dev->dev_root->fs_info;
> >>+	flags = fixup->spag.flags;
> >>+
> >>+	if (fixup->recheck && fixup->err == 0)
> >>+		ret = scrub_fixup_check(fixup);
> >>+
> >>+	if (ret || fixup->err)
> >>+		scrub_fixup(fixup);
> >>+
> >>+	__free_page(fixup->bio->bi_io_vec[0].bv_page);
> >>+	bio_put(fixup->bio);
> >>+
> >>+	mutex_lock(&fs_info->scrub_lock);
> >>+	atomic_dec(&fs_info->scrubs_running);
> >>+	mutex_unlock(&fs_info->scrub_lock);
> >>+	wake_up(&fs_info->scrub_pause_wait);
> >>+
> >>+	kfree(fixup);
> >>+}
> >>+
> >>+static void scrub_fixup_end_io(struct bio *bio, int err)
> >>+{
> >>+	complete((struct completion *)bio->bi_private);
> >>+}
> >>+
> >>+static void scrub_fixup(struct scrub_fixup *fixup)
> >>+{
> >>+	struct scrub_dev *sdev = fixup->sdev;
> >>+	struct btrfs_fs_info *fs_info = sdev->dev->dev_root->fs_info;
> >>+	struct btrfs_mapping_tree *map_tree = &fs_info->mapping_tree;
> >>+	struct btrfs_multi_bio *multi = NULL;
> >>+	struct bio *bio = fixup->bio;
> >>+	u64 length;
> >>+	int i;
> >>+	int ret;
> >>+	DECLARE_COMPLETION_ONSTACK(complete);
> >>+
> >>+	if ((fixup->spag.flags & BTRFS_EXTENT_FLAG_DATA) &&
> >>+	    (fixup->spag.have_csum == 0)) {
> >>+		/*
> >>+		 * nodatasum, don't try to fix anything
> >>+		 * FIXME: we can do better, open the inode and trigger a
> >>+		 * writeback
> >>+		 */
> >>+		goto uncorrectable;
> >>+	}
> >>+
> >>+	length = PAGE_SIZE;
> >>+	ret = btrfs_map_block(map_tree, REQ_WRITE, fixup->logical, &length,
> >>+	                      &multi, 0);
> >>+	if (ret || !multi || length < PAGE_SIZE) {
> >>+		printk(KERN_ERR
> >>+		       "scrub_fixup: btrfs_map_block failed us for %lld\n",
> >>+		       fixup->logical);
> >>+		WARN_ON(1);
> >>+		return;
> >>+	}
> >>+
> >>+	if (multi->num_stripes == 1) {
> >>+		/* there aren't any replicas */
> >>+		goto uncorrectable;
> >>+	}
> >>+
> >>+	/*
> >>+	 * first find a good copy
> >>+	 */
> >>+	for (i = 0; i < multi->num_stripes; ++i) {
> >>+		if (i == fixup->spag.mirror_num)
> >>+			continue;
> >>+
> >>+		bio->bi_sector = multi->stripes[i].physical >> 9;
> >>+		bio->bi_bdev = multi->stripes[i].dev->bdev;
> >>+		bio->bi_size = PAGE_SIZE;
> >>+		bio->bi_next = NULL;
> >>+		bio->bi_flags = 1 << BIO_UPTODATE;
> >>+		bio->bi_comp_cpu = -1;
> >>+		bio->bi_end_io = scrub_fixup_end_io;
> >>+		bio->bi_private = &complete;
> >>+
> >>+		submit_bio(0, bio);
> >>+
> >>+		wait_for_completion(&complete);
> >>+
> >>+		if (~bio->bi_flags & BIO_UPTODATE)
> >>+			/* I/O-error, this is not a good copy */
> >>+			continue;
> >>+
> >>+		ret = scrub_fixup_check(fixup);
> >>+		if (ret == 0)
> >>+			break;
> >>+	}
> >>+	if (i == multi->num_stripes)
> >>+		goto uncorrectable;
> >>+
> >>+	/*
> >>+	 * the bio now contains good data, write it back
> >>+	 */
> >>+	bio->bi_sector = fixup->physical >> 9;
> >>+	bio->bi_bdev = sdev->dev->bdev;
> >>+	bio->bi_size = PAGE_SIZE;
> >>+	bio->bi_next = NULL;
> >>+	bio->bi_flags = 1 << BIO_UPTODATE;
> >>+	bio->bi_comp_cpu = -1;
> >>+	bio->bi_end_io = scrub_fixup_end_io;
> >>+	bio->bi_private = &complete;
> >>+
> >>+	submit_bio(REQ_WRITE, bio);
> >>+
> >>+	wait_for_completion(&complete);
> >>+
> >>+	if (~bio->bi_flags & BIO_UPTODATE)
> >>+		/* I/O-error, writeback failed, give up */
> >>+		goto uncorrectable;
> >>+
> >>+	kfree(multi);
> >>+	spin_lock(&sdev->stat_lock);
> >>+	++sdev->stat.corrected_errors;
> >>+	spin_unlock(&sdev->stat_lock);
> >>+
> >>+	if (printk_ratelimit())
> >>+		printk(KERN_ERR "btrfs: fixed up at %lld\n", fixup->logical);
> >>+	return;
> >>+
> >>+uncorrectable:
> >>+	kfree(multi);
> >>+	spin_lock(&sdev->stat_lock);
> >>+	++sdev->stat.uncorrectable_errors;
> >>+	spin_unlock(&sdev->stat_lock);
> >>+
> >>+	if (printk_ratelimit())
> >>+		printk(KERN_ERR "btrfs: unable to fixup at %lld\n",
> >>+			 fixup->logical);
> >>+}
> >>+
> >>+static void scrub_bio_end_io(struct bio *bio, int err)
> >>+{
> >>+	struct scrub_bio *sbio = bio->bi_private;
> >>+	struct scrub_dev *sdev = sbio->sdev;
> >>+	struct btrfs_fs_info *fs_info = sdev->dev->dev_root->fs_info;
> >>+
> >>+	sbio->err = err;
> >>+
> >>+	SCRUB_QUEUE_WORK(fs_info->scrub_workers, &sbio->work);
> >>+}
> >>+
> >>+static void scrub_checksum(scrub_work_t *work)
> >>+{
> >>+	struct scrub_bio *sbio = container_of(work, struct scrub_bio, work);
> >>+	struct scrub_dev *sdev = sbio->sdev;
> >>+	struct page *page;
> >>+	void *buffer;
> >>+	int i;
> >>+	u64 flags;
> >>+	u64 logical;
> >>+	int ret;
> >>+
> >>+	if (sbio->err) {
> >>+		for (i = 0; i < sbio->count; ++i) {
> >>+			scrub_recheck_error(sbio, i);
> >>+		}
> >>+		spin_lock(&sdev->stat_lock);
> >>+		++sdev->stat.read_errors;
> >>+		spin_unlock(&sdev->stat_lock);
> >>+		goto out;
> >>+	}
> >>+	for (i = 0; i < sbio->count; ++i) {
> >>+		page = sbio->bio->bi_io_vec[i].bv_page;
> >>+		buffer = kmap_atomic(page, KM_USER0);
> >>+		flags = sbio->spag[i].flags;
> >>+		logical = sbio->logical + i * PAGE_SIZE;
> >>+		ret = 0;
> >>+		if (flags & BTRFS_EXTENT_FLAG_DATA) {
> >>+			ret = scrub_checksum_data(sdev, sbio->spag + i, buffer);
> >>+		} else if (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
> >>+			ret = scrub_checksum_tree_block(sdev, sbio->spag + i,
> >>+			                                logical, buffer);
> >>+		} else if (flags & BTRFS_EXTENT_FLAG_SUPER) {
> >>+			BUG_ON(i);
> >>+			(void)scrub_checksum_super(sbio, buffer);
> >>+		} else {
> >>+			WARN_ON(1);
> >>+		}
> >>+		kunmap_atomic(buffer, KM_USER0);
> >>+		if (ret)
> >>+			scrub_recheck_error(sbio, i);
> >>+	}
> >>+
> >>+out:
> >>+	spin_lock(&sdev->list_lock);
> >>+	sbio->next_free = sdev->first_free;
> >>+	sdev->first_free = sbio->index;
> >>+	spin_unlock(&sdev->list_lock);
> >>+	atomic_dec(&sdev->in_flight);
> >>+	wake_up(&sdev->list_wait);
> >>+}
> >>+
> >>+static int scrub_checksum_data(struct scrub_dev *sdev,
> >>+                               struct scrub_page *spag, void *buffer)
> >>+{
> >>+	u8 csum[BTRFS_CSUM_SIZE];
> >>+	u32 crc = ~(u32)0;
> >>+	int fail = 0;
> >>+	struct btrfs_root *root = sdev->dev->dev_root;
> >>+
> >>+	if (!spag->have_csum)
> >>+		return 0;
> >>+
> >>+	crc = btrfs_csum_data(root, buffer, crc, PAGE_SIZE);
> >>+	btrfs_csum_final(crc, csum);
> >>+	if (memcmp(csum, spag->csum, sdev->csum_size))
> >>+		fail = 1;
> >>+
> >>+	spin_lock(&sdev->stat_lock);
> >>+	++sdev->stat.data_extents_scrubbed;
> >>+	sdev->stat.data_bytes_scrubbed += PAGE_SIZE;
> >>+	if (fail)
> >>+		++sdev->stat.csum_errors;
> >>+	spin_unlock(&sdev->stat_lock);
> >>+
> >>+	return fail;
> >>+}
> >>+
> >>+static int scrub_checksum_tree_block(struct scrub_dev *sdev,
> >>+                                     struct scrub_page *spag, u64 logical,
> >>+                                     void *buffer)
> >>+{
> >>+	struct btrfs_header *h;
> >>+	struct btrfs_root *root = sdev->dev->dev_root;
> >>+	struct btrfs_fs_info *fs_info = root->fs_info;
> >>+	u8 csum[BTRFS_CSUM_SIZE];
> >>+	u32 crc = ~(u32)0;
> >>+	int fail = 0;
> >>+	int crc_fail = 0;
> >>+
> >>+	/*
> >>+	 * we don't use the getter functions here, as we
> >>+	 * a) don't have an extent buffer and
> >>+	 * b) the page is already kmapped
> >>+	 */
> >>+	h = (struct btrfs_header *)buffer;
> >>+
> >>+	if (logical != le64_to_cpu(h->bytenr))
> >>+		++fail;
> >>+
> >>+	if (spag->generation != le64_to_cpu(h->generation))
> >>+		++fail;
> >>+
> >>+	if (memcmp(h->fsid, fs_info->fsid, BTRFS_UUID_SIZE))
> >>+		++fail;
> >>+
> >>+	if (memcmp(h->chunk_tree_uuid, fs_info->chunk_tree_uuid,
> >>+	           BTRFS_UUID_SIZE))
> >>+		++fail;
> >>+
> >>+	crc = btrfs_csum_data(root, buffer + BTRFS_CSUM_SIZE, crc,
> >>+	                      PAGE_SIZE - BTRFS_CSUM_SIZE);
> >>+	btrfs_csum_final(crc, csum);
> >>+	if (memcmp(csum, h->csum, sdev->csum_size))
> >>+		++crc_fail;
> >>+
> >>+	spin_lock(&sdev->stat_lock);
> >>+	++sdev->stat.tree_extents_scrubbed;
> >>+	sdev->stat.tree_bytes_scrubbed += PAGE_SIZE;
> >>+	if (crc_fail)
> >>+		++sdev->stat.csum_errors;
> >>+	if (fail)
> >>+		++sdev->stat.verify_errors;
> >>+	spin_unlock(&sdev->stat_lock);
> >>+
> >>+	return (fail || crc_fail);
> >>+}
> >>+
> >>+static int scrub_checksum_super(struct scrub_bio *sbio, void *buffer)
> >>+{
> >>+	struct btrfs_super_block *s;
> >>+	u64 logical;
> >>+	struct scrub_dev *sdev = sbio->sdev;
> >>+	struct btrfs_root *root = sdev->dev->dev_root;
> >>+	struct btrfs_fs_info *fs_info = root->fs_info;
> >>+	u8 csum[BTRFS_CSUM_SIZE];
> >>+	u32 crc = ~(u32)0;
> >>+	int fail = 0;
> >>+
> >>+	s = (struct btrfs_super_block *)buffer;
> >>+	logical = sbio->logical;
> >>+
> >>+	if (logical != le64_to_cpu(s->bytenr))
> >>+		++fail;
> >>+
> >>+	if (sbio->spag[0].generation != le64_to_cpu(s->generation))
> >>+		++fail;
> >>+
> >>+	if (memcmp(s->fsid, fs_info->fsid, BTRFS_UUID_SIZE))
> >>+		++fail;
> >>+
> >>+	crc = btrfs_csum_data(root, buffer + BTRFS_CSUM_SIZE, crc,
> >>+	                      PAGE_SIZE - BTRFS_CSUM_SIZE);
> >>+	btrfs_csum_final(crc, csum);
> >>+	if (memcmp(csum, s->csum, sbio->sdev->csum_size))
> >>+		++fail;
> >>+
> >>+	if (fail) {
> >>+		/*
> >>+		 * if we find an error in a super block, we just report it.
> >>+		 * They will get written with the next transaction commit
> >>+		 * anyway
> >>+		 */
> >>+		spin_lock(&sdev->stat_lock);
> >>+		++sdev->stat.super_errors;
> >>+		spin_unlock(&sdev->stat_lock);
> >>+	}
> >>+
> >>+	return fail;
> >>+}
> >>+
> >>+static int scrub_submit(struct scrub_dev *sdev)
> >>+{
> >>+	struct scrub_bio *sbio;
> >>+
> >>+	if (sdev->curr == -1)
> >>+		return 0;
> >>+
> >>+	sbio = sdev->bios + sdev->curr;
> >>+	
> >>+	sbio->bio->bi_sector = sbio->physical >> 9;
> >>+	sbio->bio->bi_size = sbio->count * PAGE_SIZE;
> >>+	sbio->bio->bi_next = NULL;
> >>+	sbio->bio->bi_flags = 1 << BIO_UPTODATE;
> >>+	sbio->bio->bi_comp_cpu = -1;
> >>+	sbio->bio->bi_bdev = sdev->dev->bdev;
> >>+	sdev->curr = -1;
> >>+	atomic_inc(&sdev->in_flight);
> >>+
> >>+	submit_bio(0, sbio->bio);
> >>+
> >>+	return 0;
> >>+}
> >>+
> >>+static int scrub_page(struct scrub_dev *sdev, u64 logical, u64 len,
> >>+                      u64 physical, u64 flags, u64 gen, u64 mirror_num,
> >>+                      u8 *csum, int force)
> >>+{
> >>+	struct scrub_bio *sbio;
> >>+again:
> >>+	/*
> >>+	 * grab a fresh bio or wait for one to become available
> >>+	 */
> >>+	while (sdev->curr == -1) {
> >>+		unsigned long flags;
> >>+		spin_lock_irqsave(&sdev->list_lock, flags);
> >
> >Is this called from an interrupt or why is the _irqsave variant used?
> 
> You're right, it is not needed anymore. It used to get locked directly
> from the end_io callback, but now everything is deferred to workers.
> 
> >
> >>+		sdev->curr = sdev->first_free;
> >>+		if (sdev->curr != -1) {
> >>+			sdev->first_free = sdev->bios[sdev->curr].next_free;
> >>+			sdev->bios[sdev->curr].next_free = -1;
> >>+			sdev->bios[sdev->curr].count = 0;
> >>+			spin_unlock_irqrestore(&sdev->list_lock, flags);
> >>+		} else {
> >>+			spin_unlock_irqrestore(&sdev->list_lock, flags);
> >>+			wait_event(sdev->list_wait, sdev->first_free != -1);
> >>+		}
> >>+	}
> >>+	sbio = sdev->bios + sdev->curr;
> >>+	if (sbio->count == 0) {
> >>+		sbio->physical = physical;
> >>+		sbio->logical = logical;
> >>+	} else if (sbio->physical + sbio->count * PAGE_SIZE != physical) {
> >>+		scrub_submit(sdev);
> >>+		goto again;
> >>+	}
> >>+	sbio->spag[sbio->count].flags = flags;
> >>+	sbio->spag[sbio->count].generation = gen;
> >>+	sbio->spag[sbio->count].have_csum = 0;
> >>+	sbio->spag[sbio->count].mirror_num = mirror_num;
> >>+	if (csum) {
> >>+		sbio->spag[sbio->count].have_csum = 1;
> >>+		memcpy(sbio->spag[sbio->count].csum, csum, sdev->csum_size);
> >>+	}
> >>+	++sbio->count;
> >>+	if (sbio->count == SCRUB_PAGES_PER_BIO || force)
> >>+		scrub_submit(sdev);
> >>+		
> >>+	return 0;
> >>+}
> >>+
> >>+static int scrub_find_csum(struct scrub_dev *sdev, u64 logical, u64 len,
> >>+                           u8 *csum)
> >>+{
> >>+	struct btrfs_ordered_sum *sum = NULL;
> >>+	int ret = 0;
> >>+	unsigned long i;
> >>+	unsigned long num_sectors;
> >>+	u32 sectorsize = sdev->dev->dev_root->sectorsize;
> >>+
> >>+	while (!list_empty(&sdev->csum_list)) {
> >>+		sum = list_first_entry(&sdev->csum_list,
> >>+				       struct btrfs_ordered_sum, list);
> >>+		if (sum->bytenr > logical)
> >>+			return 0;
> >>+		if (sum->bytenr + sum->len > logical)
> >>+			break;
> >>+
> >>+		++sdev->stat.csum_discards;
> >>+		list_del(&sum->list);
> >>+		kfree(sum);
> >>+		sum = NULL;
> >>+	}
> >>+	if (!sum)
> >>+		return 0;
> >>+
> >>+	num_sectors = sum->len / sectorsize;
> >>+	for (i = 0; i < num_sectors; ++i) {
> >>+		if (sum->sums[i].bytenr == logical) {
> >>+			memcpy(csum, &sum->sums[i].sum, sdev->csum_size);
> >>+			ret = 1;
> >>+			break;
> >>+		}
> >>+	}
> >>+	if (ret && i == num_sectors - 1) {
> >>+		list_del(&sum->list);
> >>+		kfree(sum);
> >>+	}
> >>+	return ret;
> >>+}
> >>+
> >>+/* scrub extent tries to collect up to 64 kB for each bio */
> >>+static int scrub_extent(struct scrub_dev *sdev, u64 logical, u64 len,
> >>+                        u64 physical, u64 flags, u64 gen, u64 mirror_num)
> >>+{
> >>+	int ret;
> >>+	u8 csum[BTRFS_CSUM_SIZE];
> >>+
> >>+	while(len) {
> >>+		u64 l = min_t(u64, len, PAGE_SIZE);
> >>+		int have_csum = 0;
> >>+
> >>+		if (flags & BTRFS_EXTENT_FLAG_DATA) {
> >>+			/* push csums to sbio */
> >>+			have_csum = scrub_find_csum(sdev, logical, l, csum);
> >>+			if (have_csum == 0)
> >>+				++sdev->stat.no_csum;
> >>+		}
> >>+		ret = scrub_page(sdev, logical, l, physical, flags, gen,
> >>+		                 mirror_num, have_csum ? csum : NULL, 0);
> >>+		if (ret)
> >>+			return ret;
> >>+		len -= l;
> >>+		logical += l;
> >>+		physical += l;
> >>+	}
> >>+	return 0;
> >>+}
> >>+
> >>+static noinline_for_stack int scrub_stripe(struct scrub_dev *sdev,
> >>+	struct map_lookup *map, int num, u64 base, u64 length)
> >>+{
> >>+	struct btrfs_path *path;
> >>+	struct btrfs_fs_info *fs_info = sdev->dev->dev_root->fs_info;
> >>+	struct btrfs_root *root = fs_info->extent_root;
> >>+	struct btrfs_root *csum_root = fs_info->csum_root;
> >>+	struct btrfs_extent_item *extent;
> >>+	u64 flags;
> >>+	int ret;
> >>+	int slot;
> >>+	int i;
> >>+	int nstripes;
> >>+	int start_stripe;
> >>+	struct extent_buffer *l;
> >>+	struct btrfs_key key;
> >>+	u64 physical;
> >>+	u64 logical;
> >>+	u64 generation;
> >>+	u64 mirror_num;
> >>+
> >>+	u64 increment = map->stripe_len;
> >>+	u64 offset;
> >>+
> >>+	nstripes = length;
> >>+	offset = 0;
> >>+	do_div(nstripes, map->stripe_len);
> >>+	if (map->type & BTRFS_BLOCK_GROUP_RAID0) {
> >>+		offset = map->stripe_len * num;
> >>+		increment = map->stripe_len * map->num_stripes;
> >>+		mirror_num = 0;
> >>+	} else if (map->type & BTRFS_BLOCK_GROUP_RAID10) {
> >>+		int factor = map->num_stripes / map->sub_stripes;
> >>+		offset = map->stripe_len * (num / map->sub_stripes);
> >>+		increment = map->stripe_len * factor;
> >>+		mirror_num = num % map->sub_stripes;
> >>+	} else if (map->type & BTRFS_BLOCK_GROUP_RAID1) {
> >>+		increment = map->stripe_len;
> >>+		mirror_num = num % map->num_stripes;
> >>+	} else if (map->type & BTRFS_BLOCK_GROUP_DUP) {
> >>+		increment = map->stripe_len;
> >>+		mirror_num = num % map->num_stripes;
> >>+	} else {
> >>+		increment = map->stripe_len;
> >>+		mirror_num = 0;
> >>+	}
> >>+
> >>+	path = btrfs_alloc_path();
> >>+	if (!path)
> >>+		return -ENOMEM;
> >>+
> >>+	path->reada = 2;
> >>+	path->search_commit_root = 1;
> >>+	path->skip_locking = 1;
> >>+
> >>+	/*
> >>+	 * find all extents for each stripe and just read them to get
> >>+	 * them into the page cache
> >>+	 * FIXME: we can do better. build a more intelligent prefetching
> >>+	 */
> >>+	logical = base + offset;
> >>+	physical = map->stripes[num].physical;
> >>+	ret = 0;
> >>+	for (i = 0; i < nstripes; ++i) {
> >>+		key.objectid = logical;
> >>+		key.type = BTRFS_EXTENT_ITEM_KEY;
> >>+		key.offset = (u64)0;
> >>+
> >>+		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> >>+		if (ret < 0)
> >>+			goto out;
> >>+
> >>+		l = path->nodes[0];
> >>+		slot = path->slots[0];
> >>+		btrfs_item_key_to_cpu(l, &key, slot);
> >>+		if (key.objectid != logical) {
> >>+			ret = btrfs_previous_item(root, path, 0,
> >>+			                          BTRFS_EXTENT_ITEM_KEY);
> >>+			if (ret < 0)
> >>+				goto out;
> >>+		}
> >>+
> >>+		while (1) {
> >>+			l = path->nodes[0];
> >>+			slot = path->slots[0];
> >>+			if (slot >= btrfs_header_nritems(l)) {
> >>+				ret = btrfs_next_leaf(root, path);
> >>+				if (ret == 0)
> >>+					continue;
> >>+				if (ret < 0)
> >>+					goto out;
> >>+
> >>+				break;
> >>+			}
> >>+			btrfs_item_key_to_cpu(l, &key, slot);
> >>+
> >>+			if (key.objectid + key.offset <= logical)
> >>+				goto next1;
> >>+
> >>+			if (key.objectid >= logical + map->stripe_len)
> >>+				break;
> >>+next1:
> >>+			path->slots[0]++;
> >>+		}
> >>+		btrfs_release_path(root, path);
> >>+		logical += increment;
> >>+		physical += map->stripe_len;
> >>+		cond_resched();
> >>+	}
> >>+
> >>+	/*
> >>+	 * collect all data csums for the stripe to avoid seeking during
> >>+	 * the scrub. This might currently (crc32) end up to be about 1MB
> >>+	 */
> >>+	start_stripe = 0;
> >>+again:
> >>+	logical = base + offset + start_stripe * map->stripe_len;
> >>+	physical = map->stripes[num].physical + start_stripe * map->stripe_len;
> >>+	for (i = start_stripe; i < nstripes; ++i) {
> >>+		ret = btrfs_lookup_csums_range(csum_root, logical,
> >>+		                               logical + map->stripe_len - 1,
> >>+		                               &sdev->csum_list, 1);
> >>+		if (ret)
> >>+			goto out;
> >>+
> >>+		logical += increment;
> >>+		cond_resched();
> >>+	}
> >>+	/*
> >>+	 * now find all extents for each stripe and scrub them
> >>+	 */
> >>+	logical = base + offset + start_stripe * map->stripe_len;
> >>+	physical = map->stripes[num].physical + start_stripe * map->stripe_len;
> >>+	ret = 0;
> >>+	for (i = start_stripe; i < nstripes; ++i) {
> >>+		/*
> >>+		 * canceled?
> >>+		 */
> >>+		if (atomic_read(&fs_info->scrub_cancel_req) ||
> >>+		    atomic_read(&sdev->cancel_req)) {
> >>+			ret = -ECANCELED;
> >>+			goto out;
> >>+		}
> >>+		/*
> >>+		 * check to see if we have to pause
> >>+		 */
> >>+		if (atomic_read(&fs_info->scrub_pause_req)) {
> >>+			/* push queued extents */
> >>+			scrub_submit(sdev);
> >>+			wait_event(sdev->list_wait,
> >>+			           atomic_read(&sdev->in_flight) == 0);
> >>+			atomic_inc(&fs_info->scrubs_paused);
> >>+			wake_up(&fs_info->scrub_pause_wait);
> >>+			mutex_lock(&fs_info->scrub_lock);
> >>+			while(atomic_read(&fs_info->scrub_pause_req)) {
> >>+				mutex_unlock(&fs_info->scrub_lock);
> >>+				wait_event(fs_info->scrub_pause_wait,
> >>+				   atomic_read(&fs_info->scrub_pause_req) == 0);
> >>+				mutex_lock(&fs_info->scrub_lock);
> >>+			}
> >>+			atomic_dec(&fs_info->scrubs_paused);
> >>+			mutex_unlock(&fs_info->scrub_lock);
> >>+			wake_up(&fs_info->scrub_pause_wait);
> >>+			scrub_free_csums(sdev);
> >>+			goto again;
> >>+		}
> >>+
> >>+		key.objectid = logical;
> >>+		key.type = BTRFS_EXTENT_ITEM_KEY;
> >>+		key.offset = (u64)0;
> >>+
> >>+		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> >>+		if (ret < 0)
> >>+			goto out;
> >>+
> >>+		l = path->nodes[0];
> >>+		slot = path->slots[0];
> >>+		btrfs_item_key_to_cpu(l, &key, slot);
> >>+		if (key.objectid != logical) {
> >>+			ret = btrfs_previous_item(root, path, 0,
> >>+			                          BTRFS_EXTENT_ITEM_KEY);
> >>+			if (ret < 0)
> >>+				goto out;
> >>+		}
> >>+
> >>+		while (1) {
> >>+			l = path->nodes[0];
> >>+			slot = path->slots[0];
> >>+			if (slot >= btrfs_header_nritems(l)) {
> >>+				ret = btrfs_next_leaf(root, path);
> >>+				if (ret == 0)
> >>+					continue;
> >>+				if (ret < 0)
> >>+					goto out;
> >>+
> >>+				break;
> >>+			}
> >>+			btrfs_item_key_to_cpu(l, &key, slot);
> >>+
> >>+			if (key.objectid + key.offset <= logical)
> >>+				goto next;
> >>+
> >>+			if (key.objectid >= logical + map->stripe_len)
> >>+				break;
> >>+
> >>+			if (btrfs_key_type(&key) != BTRFS_EXTENT_ITEM_KEY)
> >>+				goto next;
> >>+
> >>+			extent = btrfs_item_ptr(l, slot,
> >>+			                        struct btrfs_extent_item);
> >>+			flags = btrfs_extent_flags(l, extent);
> >>+			generation = btrfs_extent_generation(l, extent);
> >>+
> >>+			if (key.objectid < logical &&
> >>+			    (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK)) {
> >>+				printk(KERN_ERR
> >>+				       "btrfs scrub: tree block %lld spanning "
> >>+				       "stripes, ignored. logical=%lld\n",
> >>+				       key.objectid, logical);
> >>+				goto next;
> >>+			}
> >>+
> >>+			/*
> >>+			 * trim extent to this stripe
> >>+			 */
> >>+			if (key.objectid < logical) {
> >>+				key.offset -= logical - key.objectid;
> >>+				key.objectid = logical;
> >>+			}
> >>+			if (key.objectid + key.offset >
> >>+			    logical + map->stripe_len) {
> >>+				key.offset = logical + map->stripe_len -
> >>+				             key.objectid;
> >>+			}
> >>+
> >>+			ret = scrub_extent(sdev, key.objectid, key.offset,
> >>+			                   key.objectid - logical + physical,
> >>+			                   flags, generation, mirror_num);
> >>+			if (ret)
> >>+				goto out;
> >>+next:
> >>+			path->slots[0]++;
> >>+		}
> >>+		btrfs_release_path(root, path);
> >>+		logical += increment;
> >>+		physical += map->stripe_len;
> >>+		spin_lock(&sdev->stat_lock);
> >>+		sdev->stat.last_physical = physical;
> >>+		spin_unlock(&sdev->stat_lock);
> >>+	}
> >>+	/* push queued extents */
> >>+	scrub_submit(sdev);
> >>+
> >>+out:
> >>+	btrfs_free_path(path);
> >>+	return ret < 0 ? ret : 0;
> >>+}
> >>+
> >>+static noinline_for_stack int scrub_chunk(struct scrub_dev
> >>*sdev, +	u64 chunk_tree, u64 chunk_objectid, u64 chunk_offset,
> >>u64 length)
> >>+{
> >>+	struct btrfs_mapping_tree *map_tree =
> >>+		&sdev->dev->dev_root->fs_info->mapping_tree;
> >>+	struct map_lookup *map;
> >>+	struct extent_map *em;
> >>+	int i;
> >>+	int ret;
> >>+
> >>+	read_lock(&map_tree->map_tree.lock);
> >>+	em = lookup_extent_mapping(&map_tree->map_tree, chunk_offset, 1);
> >>+	read_unlock(&map_tree->map_tree.lock);
> >>+
> >>+	if (!em)
> >>+		return -EINVAL;
> >>+
> >>+	map = (struct map_lookup *)em->bdev;
> >>+	if (em->start != chunk_offset)
> >>+		return -EINVAL;
> >>+
> >>+	if (em->len < length)
> >>+		return -EINVAL;
> >>+
> >>+	for (i = 0; i < map->num_stripes; ++i) {
> >>+		if (map->stripes[i].dev == sdev->dev) {
> >>+			ret = scrub_stripe(sdev, map, i, chunk_offset, length);
> >>+			if (ret)
> >>+				return ret;
> >>+		}
> >>+	}
> >>+	return 0;
> >>+}
> >>+
> >>+static noinline_for_stack
> >>+int scrub_enumerate_chunks(struct scrub_dev *sdev, u64 start, u64 end)
> >>+{
> >>+	struct btrfs_dev_extent *dev_extent = NULL;
> >>+	struct btrfs_path *path;
> >>+	struct btrfs_root *root = sdev->dev->dev_root;
> >>+	struct btrfs_fs_info *fs_info = root->fs_info;
> >>+	u64 length;
> >>+	u64 chunk_tree;
> >>+	u64 chunk_objectid;
> >>+	u64 chunk_offset;
> >>+	int ret;
> >>+	int slot;
> >>+	struct extent_buffer *l;
> >>+	struct btrfs_key key;
> >>+	struct btrfs_key found_key;
> >>+	struct btrfs_block_group_cache *cache;
> >>+
> >>+	path = btrfs_alloc_path();
> >>+	if (!path)
> >>+		return -ENOMEM;
> >>+
> >>+	path->reada = 2;
> >>+	path->search_commit_root = 1;
> >>+	path->skip_locking = 1;
> >>+
> >>+	key.objectid = sdev->dev->devid;
> >>+	key.offset = 0ull;
> >>+	key.type = BTRFS_DEV_EXTENT_KEY;
> >>+
> >>+
> >>+	while (1) {
> >>+		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> >>+		if (ret < 0)
> >>+			goto out;
> >>+		ret = 0;
> >>+
> >>+		l = path->nodes[0];
> >>+		slot = path->slots[0];
> >>+
> >>+		btrfs_item_key_to_cpu(l, &found_key, slot);
> >>+
> >>+		if (found_key.objectid != sdev->dev->devid)
> >>+			break;
> >>+
> >>+		if (btrfs_key_type(&key) != BTRFS_DEV_EXTENT_KEY)
> >>+			break;
> >>+
> >>+		if (found_key.offset >= end)
> >>+			break;
> >>+
> >>+		if (found_key.offset < key.offset)
> >>+			break;
> >>+
> >>+		dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent);
> >>+		length = btrfs_dev_extent_length(l, dev_extent);
> >>+
> >>+		if (found_key.offset + length <= start) {
> >>+			key.offset = found_key.offset + length;
> >>+			btrfs_release_path(root, path);
> >>+			continue;
> >>+		}
> >>+
> >>+		chunk_tree = btrfs_dev_extent_chunk_tree(l, dev_extent);
> >>+		chunk_objectid = btrfs_dev_extent_chunk_objectid(l, dev_extent);
> >>+		chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent);
> >>+
> >>+		/*
> >>+		 * get a reference on the corresponding block group to prevent
> >>+		 * the chunk from going away while we scrub it
> >>+		 */
> >>+		cache = btrfs_lookup_block_group(fs_info, chunk_offset);
> >>+		if (!cache) {
> >>+			ret = -ENOENT;
> >>+			goto out;
> >>+		}
> >>+		ret = scrub_chunk(sdev, chunk_tree, chunk_objectid,
> >>+		                  chunk_offset, length);
> >>+		btrfs_put_block_group(cache);
> >>+		if (ret)
> >>+			break;
> >>+
> >>+		key.offset = found_key.offset + length;
> >>+		btrfs_release_path(root, path);
> >>+	}
> >>+
> >>+out:
> >>+	btrfs_free_path(path);
> >>+	return ret;
> >>+}
> >>+
> >>+static noinline_for_stack int scrub_supers(struct scrub_dev *sdev)
> >>+{
> >>+	int	i;
> >>+	u64	bytenr;
> >>+	u64	gen;
> >>+	int	ret;
> >>+	struct btrfs_device *device = sdev->dev;
> >>+	struct btrfs_root *root = device->dev_root;
> >>+
> >>+	gen = root->fs_info->last_trans_committed;
> >>+
> >>+	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
> >>+		bytenr = btrfs_sb_offset(i);
> >>+		if (bytenr + BTRFS_SUPER_INFO_SIZE >= device->total_bytes)
> >>+			break;
> >>+
> >>+		ret = scrub_page(sdev, bytenr, PAGE_SIZE, bytenr, +		
> >>BTRFS_EXTENT_FLAG_SUPER, gen, i, NULL, 1);
> >>+		if (ret)
> >>+			return ret;
> >>+	}
> >>+	wait_event(sdev->list_wait, atomic_read(&sdev->in_flight) == 0);
> >>+
> >>+	return 0;
> >>+}
> >>+
> >>+/*
> >>+ * get a reference count on fs_info->scrub_workers. start worker if necessary
> >>+ */
> >>+static noinline_for_stack int scrub_workers_get(struct btrfs_root *root)
> >>+{
> >>+	struct btrfs_fs_info *fs_info = root->fs_info;
> >>+
> >>+	mutex_lock(&fs_info->scrub_lock);
> >>+	if (fs_info->scrub_workers_refcnt == 0) {
> >>+#ifdef SCRUB_BTRFS_WORKER
> >>+		btrfs_start_workers(&fs_info->scrub_workers, 1);
> >>+#else
> >>+		fs_info->scrub_workers = create_workqueue("scrub");
> >>+		if (!fs_info->scrub_workers) {
> >>+			mutex_unlock(&fs_info->scrub_lock);
> >>+			return -ENOMEM;
> >>+		}
> >>+#endif
> >>+	}
> >>+	++fs_info->scrub_workers_refcnt;
> >>+	mutex_unlock(&fs_info->scrub_lock);
> >>+
> >>+	return 0;
> >>+}
> >>+
> >>+static noinline_for_stack void scrub_workers_put(struct btrfs_root *root)
> >
> >This func is always called immediately after a mutex_unlock(scrub_lock),
> >and then takes the lock again. I suggest to drop locking here and adjust
> >all callsites.
> >
> 
> This does hold only for 2 out of 4 calls. I don't know if it's
> worth it, as this is only a very low frequency path.

Well, you're right, not an issue for the exit paths.

> 
> >Same applies for scrub_workers_get()

And this was not correct, _get has only one call site.


dave

> >
> >>+{
> >>+	struct btrfs_fs_info *fs_info = root->fs_info;
> >>+	
> >>+	mutex_lock(&fs_info->scrub_lock);
> >>+	if (--fs_info->scrub_workers_refcnt == 0) {
> >>+#ifdef SCRUB_BTRFS_WORKER
> >>+		btrfs_stop_workers(&fs_info->scrub_workers);
> >>+#else
> >>+		destroy_workqueue(fs_info->scrub_workers);
> >>+		fs_info->scrub_workers = NULL;
> >>+#endif
> >>+
> >>+	}
> >>+	WARN_ON(fs_info->scrub_workers_refcnt < 0);
> >>+	mutex_unlock(&fs_info->scrub_lock);
> >>+}
> >>+
> >>+
> >>+int btrfs_scrub_dev(struct btrfs_root *root, u64 devid, u64 start, u64 end,
> >>+                    struct btrfs_scrub_progress *progress)
> >>+{
> >>+	struct scrub_dev *sdev;
> >>+	struct btrfs_fs_info *fs_info = root->fs_info;
> >>+	int ret;
> >>+	struct btrfs_device *dev;
> >>+
> >>+	if (root->fs_info->closing)
> >>+		return -EINVAL;
> >>+
> >>+	/*
> >>+	 * check some assumptions
> >>+	 */
> >>+	if (root->sectorsize != PAGE_SIZE ||
> >>+	    root->sectorsize != root->leafsize ||
> >>+	    root->sectorsize != root->nodesize) {
> >>+		printk(KERN_ERR "btrfs_scrub: size assumptions fail\n");
> >>+		return -EINVAL;
> >>+	}
> >>+	    +	ret = scrub_workers_get(root);
> >>+	if (ret)
> >>+		return ret;
> >>+
> >>+	mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
> >>+	dev = btrfs_find_device(root, devid, NULL, NULL);
> >>+	if (!dev || dev->missing) {
> >>+		mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
> >>+		scrub_workers_put(root);
> >>+		return -ENODEV;
> >>+	}
> >>+
> >>+	mutex_lock(&fs_info->scrub_lock);
> >>+	if (dev->scrub_device) {
> >>+		mutex_unlock(&fs_info->scrub_lock);
> >>+		scrub_workers_put(root);
> >>+		return -EINPROGRESS;
> >>+	}
> >>+	sdev = scrub_setup_dev(dev);
> >>+	if (IS_ERR(sdev)) {
> >>+		mutex_unlock(&fs_info->scrub_lock);
> >>+		scrub_workers_put(root);
> >>+		return PTR_ERR(sdev);
> >>+	}
> >>+	dev->scrub_device = sdev;
> >>+
> >>+	atomic_inc(&fs_info->scrubs_running);
> >>+	mutex_unlock(&fs_info->scrub_lock);
> >>+	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
> >>+
> >>+	down_read(&fs_info->scrub_super_lock);
> >>+	ret = scrub_supers(sdev);
> >>+	up_read(&fs_info->scrub_super_lock);
> >>+
> >>+	if (!ret)
> >>+		ret = scrub_enumerate_chunks(sdev, start, end);
> >>+
> >>+	wait_event(sdev->list_wait, atomic_read(&sdev->in_flight) == 0);
> >>+
> >>+	mutex_lock(&fs_info->scrub_lock);
> >>+	atomic_dec(&fs_info->scrubs_running);
> >>+	mutex_unlock(&fs_info->scrub_lock);
> >>+	wake_up(&fs_info->scrub_pause_wait);
> >>+
> >>+	if (progress)
> >>+		memcpy(progress, &sdev->stat, sizeof(*progress));
> >>+
> >>+	mutex_lock(&fs_info->scrub_lock);
> >>+	dev->scrub_device = NULL;
> >>+	mutex_unlock(&fs_info->scrub_lock);
> >>+
> >>+	scrub_free_dev(sdev);
> >>+	scrub_workers_put(root);
> >>+
> >>+	return ret;
> >>+}
> >>+
> >>+int btrfs_scrub_pause(struct btrfs_root *root)
> >>+{
> >>+	struct btrfs_fs_info *fs_info = root->fs_info;
> >>+	mutex_lock(&fs_info->scrub_lock);
> >>+	atomic_inc(&fs_info->scrub_pause_req);
> >>+	while (atomic_read(&fs_info->scrubs_paused) !=
> >>+	       atomic_read(&fs_info->scrubs_running)) {
> >>+		mutex_unlock(&fs_info->scrub_lock);
> >>+		wait_event(fs_info->scrub_pause_wait,
> >>+			   atomic_read(&fs_info->scrubs_paused) ==
> >>+			   atomic_read(&fs_info->scrubs_running));
> >>+		mutex_lock(&fs_info->scrub_lock);
> >>+	}
> >>+	mutex_unlock(&fs_info->scrub_lock);
> >>+
> >>+	return 0;
> >>+}
> >>+
> >>+int btrfs_scrub_continue(struct btrfs_root *root)
> >>+{
> >>+	struct btrfs_fs_info *fs_info = root->fs_info;
> >>+
> >>+	atomic_dec(&fs_info->scrub_pause_req);
> >>+	wake_up(&fs_info->scrub_pause_wait);
> >>+	return 0;
> >>+}
> >>+
> >>+int btrfs_scrub_pause_super(struct btrfs_root *root)
> >>+{
> >>+	down_write(&root->fs_info->scrub_super_lock);
> >>+	return 0;
> >>+}
> >>+
> >>+int btrfs_scrub_continue_super(struct btrfs_root *root)
> >>+{
> >>+	up_write(&root->fs_info->scrub_super_lock);
> >>+	return 0;
> >>+}
> >>+
> >>+int btrfs_scrub_cancel(struct btrfs_root *root)
> >>+{
> >>+	struct btrfs_fs_info *fs_info = root->fs_info;
> >>+	mutex_lock(&fs_info->scrub_lock);
> >>+	if (!atomic_read(&fs_info->scrubs_running)) {
> >>+		mutex_unlock(&fs_info->scrub_lock);
> >>+		return -ENOTCONN;
> >>+	}
> >>+
> >>+	atomic_inc(&fs_info->scrub_cancel_req);
> >>+	while(atomic_read(&fs_info->scrubs_running)) {
> >>+		mutex_unlock(&fs_info->scrub_lock);
> >>+		wait_event(fs_info->scrub_pause_wait,
> >>+			   atomic_read(&fs_info->scrubs_running) == 0);
> >>+		mutex_lock(&fs_info->scrub_lock);
> >>+	}
> >>+	atomic_dec(&fs_info->scrub_cancel_req);
> >>+	mutex_unlock(&fs_info->scrub_lock);
> >>+	
> >>+	return 0;
> >>+}
> >>+
> >>+int btrfs_scrub_cancel_dev(struct btrfs_root *root, struct btrfs_device *dev)
> >>+{
> >>+	struct btrfs_fs_info *fs_info = root->fs_info;
> >>+	struct scrub_dev *sdev;
> >>+
> >>+	mutex_lock(&fs_info->scrub_lock);
> >>+	sdev = dev->scrub_device;
> >>+	if (!sdev) {
> >>+		mutex_unlock(&fs_info->scrub_lock);
> >>+		return -ENOTCONN;
> >>+	}
> >>+	atomic_inc(&sdev->cancel_req);
> >>+	while(dev->scrub_device) {
> >>+		mutex_unlock(&fs_info->scrub_lock);
> >>+		wait_event(fs_info->scrub_pause_wait,
> >>+		           dev->scrub_device == NULL);
> >>+		mutex_lock(&fs_info->scrub_lock);
> >>+	}
> >>+	mutex_unlock(&fs_info->scrub_lock);
> >>+		
> >>+	return 0;
> >>+}
> >>+int btrfs_scrub_cancel_devid(struct btrfs_root *root, u64 devid)
> >>+{
> >>+	struct btrfs_fs_info *fs_info = root->fs_info;
> >>+	struct btrfs_device *dev;
> >>+	int ret;
> >>+
> >>+	/*
> >>+	 * we have to hold the device_list_mutex here so the device
> >>+	 * does not go away in cancel_dev. FIXME: find a better solution
> >>+	 */
> >>+	mutex_lock(&fs_info->fs_devices->device_list_mutex);
> >>+	dev = btrfs_find_device(root, devid, NULL, NULL);
> >>+	if (!dev) {
> >>+		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
> >>+		return -ENODEV;
> >>+	}
> >>+	ret = btrfs_scrub_cancel_dev(root, dev);
> >>+	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
> >>+
> >>+	return ret;
> >>+}
> >>+	
> >>+int btrfs_scrub_progress(struct btrfs_root *root, u64 devid,
> >>+                         struct btrfs_scrub_progress *progress)
> >>+{
> >>+	struct btrfs_device *dev;
> >>+	struct scrub_dev *sdev = NULL;
> >>+
> >>+	mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
> >>+	dev = btrfs_find_device(root, devid, NULL, NULL);
> >>+	if (dev)
> >>+		sdev = dev->scrub_device;
> >>+	if (sdev)
> >>+		memcpy(progress, &sdev->stat, sizeof(*progress));
> >>+	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
> >>+
> >>+	return dev ? (sdev ? 0 : -ENOTCONN) : -ENODEV;
> >>+}
> >>-- 
> >>1.7.3.4
> >>
> >>--
> >>To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >>the body of a message to majordomo@vger.kernel.org
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/6] btrfs: add scrub code and prototypes
  2011-03-22 16:38       ` David Sterba
@ 2011-03-23 14:19         ` Arne Jansen
  0 siblings, 0 replies; 15+ messages in thread
From: Arne Jansen @ 2011-03-23 14:19 UTC (permalink / raw)
  To: linux-btrfs, dave

On 22.03.2011 17:38, David Sterba wrote:

>> David Sterba wrote:
>>> On Fri, Mar 11, 2011 at 03:49:40PM +0100, Arne Jansen wrote:
>>>> This is the main scrub code.
>>>>
>>>
>>> sizeof(struct scrub_dev) == 18760 on an x86_64, an order 3 allocation in
>>> scrub_setup_dev()
>>
>> Is this a problem? There are only few allocations of it, one per device.
> 
> High order allocations may fail when memory is fragmented, and should be
> avoided when possible. (And it is here, allocate each 'struct scrub_bio'
> separately and fill the bios array with pointers.) Scrub ioctl may fail
> to start until order-3 allocation will be available.
> 

I updated this in my git repo.

Thanks,
Arne

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-03-23 14:19 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-03-11 14:49 [PATCH v2 0/6] btrfs: scrub Arne Jansen
2011-03-11 14:49 ` [PATCH v2 1/6] btrfs: add parameter to btrfs_lookup_csum_range Arne Jansen
2011-03-11 14:49 ` [PATCH v2 2/6] btrfs: make struct map_lookup public Arne Jansen
2011-03-11 14:49 ` [PATCH v2 3/6] btrfs: add scrub code and prototypes Arne Jansen
2011-03-11 16:34   ` David Sterba
2011-03-12 10:54     ` Arne Jansen
2011-03-22 16:38       ` David Sterba
2011-03-23 14:19         ` Arne Jansen
2011-03-11 14:49 ` [PATCH v2 4/6] btrfs: sync scrub with commit & device removal Arne Jansen
2011-03-11 14:49 ` [PATCH v2 5/6] btrfs: add state information for scrub Arne Jansen
2011-03-11 16:53   ` David Sterba
2011-03-12 13:13     ` Arne Jansen
2011-03-11 14:49 ` [PATCH v2 6/6] btrfs: new ioctls " Arne Jansen
2011-03-11 16:17 ` [PATCH v2 0/6] btrfs: scrub Ric Wheeler
2011-03-12 13:20   ` Arne Jansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).