All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/19]
@ 2016-12-26  6:29 Qu Wenruo
  2016-12-26  6:29 ` [PATCH v2 01/19] btrfs-progs: raid56: Introduce raid56 header for later recovery usage Qu Wenruo
                   ` (20 more replies)
  0 siblings, 21 replies; 27+ messages in thread
From: Qu Wenruo @ 2016-12-26  6:29 UTC (permalink / raw)
  To: linux-btrfs

For any one who wants to try it, it can be get from my repo:
https://github.com/adam900710/btrfs-progs/tree/offline_scrub

Currently, I only tested it on SINGLE/DUP/RAID1/RAID5 filesystems, with
mirror or parity or data corrupted.
The tool are all able to detect them and give recoverbility report.

Several reports on kernel scrub screwing up good data stripes are in ML
for sometime.

And since kernel scrub won't account P/Q corruption, it makes us quite
to detect error like kernel screwing up P/Q when scrubbing.

To get a comparable tool for kernel scrub, we need a user-space tool to
act as benchmark to compare their different behaviors.

So here is the patchset for user-space scrub.

Which can do:

1) All mirror/backup check for non-parity based stripe
   Which means for RAID1/DUP/RAID10, we can really check all mirrors
   other than the 1st good mirror.

   Current "--check-data-csum" option will be finally replace by scrub.
   As it doesn't really check all mirrors, if it hits a good copy, then
   resting copies will just be ignored.

2) Comprehensive RAID5/6 full stripe check
   It will take full use of btrfs csum(both tree and data).
   It will only recover the full stripe if all recovered data matches
   with its csum.

In fact, it can already expose several new btrfs kernel bug.
As it's the main tool I'm using when developing the kernel fixes.

For example, after screwing up a data stripe, kernel did repairs using
parity, but recovered full stripe has wrong parity.
Need to scrub again to fix it.

And this patchset also introduced new map_block() function, which is
more flex than current btrfs_map_block(), and has a unified interface
for all profiles, not just an array of physical addresses.

Check the 6th and 7th patch for details.

They are already used in RAID5/6 scrub, but can also be used for other
profiles too.

The to-do list has been shortened, since RAID6 and new check logical is
introduced.
1) Repair support
   In fact, current tool can already report recoverability, repair is
   not hard to implement.

2) Test cases
   Need to make the infrastructure able to handle multi-device first.

3) Make btrfsck able to handle RAID5 with missing device
   Now it doesn't even open RAID5 btrfs with missing device, even though
   scrub should be able to handle it.

Changelog:
V0.8 RFC:
   Initial RFC patchset

v1:
   First formal patchset.
   RAID6 recovery support added, mainly copied from kernel radi6 lib.
   Cleaner recovery logical.

v2:
   More comments in both code and commit message, suggested by David.
   File re-arrangement, no check/ dir, raid56.ch moved to kernel-lib,
   Suggested by David

Qu Wenruo (19):
  btrfs-progs: raid56: Introduce raid56 header for later recovery usage
  btrfs-progs: raid56: Introduce tables for RAID6 recovery
  btrfs-progs: raid56: Allow raid6 to recover 2 data stripes
  btrfs-progs: raid56: Allow raid6 to recover data and p
  btrfs-progs: Introduce wrapper to recover raid56 data
  btrfs-progs: Introduce new btrfs_map_block function which returns more
    unified result.
  btrfs-progs: Allow __btrfs_map_block_v2 to remove unrelated stripes
  btrfs-progs: csum: Introduce function to read out one data csum
  btrfs-progs: scrub: Introduce structures to support fsck scrub for
    RAID56
  btrfs-progs: scrub: Introduce function to scrub mirror based tree
    block
  btrfs-progs: scrub: Introduce function to scrub mirror based data
    blocks
  btrfs-progs: scrub: Introduce function to scrub one extent
  btrfs-progs: scrub: Introduce function to scrub one data stripe
  btrfs-progs: scrub: Introduce function to verify parities
  btrfs-progs: extent-tree: Introduce function to check if there is any
    extent in given range.
  btrfs-progs: scrub: Introduce function to recover data parity
  btrfs-progs: scrub: Introduce a function to scrub one full stripe
  btrfs-progs: scrub: Introduce function to check a whole block group
  btrfs-progs: fsck: Introduce offline scrub function

 .gitignore                         |    2 +
 Documentation/btrfs-check.asciidoc |    7 +
 Makefile.in                        |   19 +-
 cmds-check.c                       |   12 +-
 csum.c                             |   96 ++++
 ctree.h                            |    8 +
 disk-io.c                          |    4 +-
 disk-io.h                          |    7 +-
 extent-tree.c                      |   60 +++
 kernel-lib/mktables.c              |  148 ++++++
 kernel-lib/raid56.c                |  359 +++++++++++++
 kernel-lib/raid56.h                |   58 +++
 raid56.c                           |  172 ------
 scrub.c                            | 1004 ++++++++++++++++++++++++++++++++++++
 volumes.c                          |  283 ++++++++++
 volumes.h                          |   49 ++
 16 files changed, 2103 insertions(+), 185 deletions(-)
 create mode 100644 csum.c
 create mode 100644 kernel-lib/mktables.c
 create mode 100644 kernel-lib/raid56.c
 create mode 100644 kernel-lib/raid56.h
 delete mode 100644 raid56.c
 create mode 100644 scrub.c

-- 
2.11.0




^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v2 01/19] btrfs-progs: raid56: Introduce raid56 header for later recovery usage
  2016-12-26  6:29 [PATCH v2 00/19] Qu Wenruo
@ 2016-12-26  6:29 ` Qu Wenruo
  2016-12-26  6:29 ` [PATCH v2 02/19] btrfs-progs: raid56: Introduce tables for RAID6 recovery Qu Wenruo
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Qu Wenruo @ 2016-12-26  6:29 UTC (permalink / raw)
  To: linux-btrfs

Introduce a new header, kernel-lib/raid56.h, for later raid56 works.

It contains 2 functions, from original btrfs-progs code:
void raid6_gen_syndrome(int disks, size_t bytes, void **ptrs);
int raid5_gen_result(int nr_devs, size_t stripe_len, int dest, void **data);

Will be expanded later and some part of it(RAID6 recover part) may keep
sync with kernel later.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 Makefile.in         |  2 +-
 disk-io.h           |  5 -----
 kernel-lib/raid56.h | 28 ++++++++++++++++++++++++++++
 volumes.c           |  1 +
 4 files changed, 30 insertions(+), 6 deletions(-)
 create mode 100644 kernel-lib/raid56.h

diff --git a/Makefile.in b/Makefile.in
index 0e3a0a0f..6e009bff 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -110,7 +110,7 @@ libbtrfs_objects = send-stream.o send-utils.o kernel-lib/rbtree.o btrfs-list.o \
 		   uuid-tree.o utils-lib.o rbtree-utils.o
 libbtrfs_headers = send-stream.h send-utils.h send.h kernel-lib/rbtree.h btrfs-list.h \
 	       kernel-lib/crc32c.h kernel-lib/list.h kerncompat.h \
-	       kernel-lib/radix-tree.h extent-cache.h \
+	       kernel-lib/radix-tree.h kernel-lib/raid56.h extent-cache.h \
 	       extent_io.h ioctl.h ctree.h btrfsck.h version.h
 TESTS = fsck-tests.sh convert-tests.sh
 
diff --git a/disk-io.h b/disk-io.h
index 1c8387e7..4de9fef7 100644
--- a/disk-io.h
+++ b/disk-io.h
@@ -196,9 +196,4 @@ int write_tree_block(struct btrfs_trans_handle *trans,
 		     struct extent_buffer *eb);
 int write_and_map_eb(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 		     struct extent_buffer *eb);
-
-/* raid56.c */
-void raid6_gen_syndrome(int disks, size_t bytes, void **ptrs);
-int raid5_gen_result(int nr_devs, size_t stripe_len, int dest, void **data);
-
 #endif
diff --git a/kernel-lib/raid56.h b/kernel-lib/raid56.h
new file mode 100644
index 00000000..7d4f4678
--- /dev/null
+++ b/kernel-lib/raid56.h
@@ -0,0 +1,28 @@
+/*
+ * Copyright (C) 2016 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef _BTRFS_PROGS_RAID56_H
+#define _BTRFS_PROGS_RAID56_H
+/*
+ * Headers for RAID5/6 operations.
+ * Original headers from original RAID5/6 codes, not from kernel header.
+ */
+
+void raid6_gen_syndrome(int disks, size_t bytes, void **ptrs);
+int raid5_gen_result(int nr_devs, size_t stripe_len, int dest, void **data);
+#endif
diff --git a/volumes.c b/volumes.c
index a0a85edd..f17bdeed 100644
--- a/volumes.c
+++ b/volumes.c
@@ -28,6 +28,7 @@
 #include "print-tree.h"
 #include "volumes.h"
 #include "utils.h"
+#include "kernel-lib/raid56.h"
 
 struct stripe {
 	struct btrfs_device *dev;
-- 
2.11.0




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 02/19] btrfs-progs: raid56: Introduce tables for RAID6 recovery
  2016-12-26  6:29 [PATCH v2 00/19] Qu Wenruo
  2016-12-26  6:29 ` [PATCH v2 01/19] btrfs-progs: raid56: Introduce raid56 header for later recovery usage Qu Wenruo
@ 2016-12-26  6:29 ` Qu Wenruo
  2016-12-26  6:29 ` [PATCH v2 03/19] btrfs-progs: raid56: Allow raid6 to recover 2 data stripes Qu Wenruo
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Qu Wenruo @ 2016-12-26  6:29 UTC (permalink / raw)
  To: linux-btrfs

Use kernel RAID6 galois tables for later RAID6 recovery.

Galois tables file, kernel-lib/tables.c is generated by user space
program, mktable.

Galois field tables declaration, in kernel-lib/raid56.h, is completely
copied from kernel.

The mktables.c is copied from kernel with minor header/macro
modification, to ensure the generated tables.c works well in
btrfs-progs.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 .gitignore            |   2 +
 Makefile.in           |  15 ++++-
 kernel-lib/mktables.c | 148 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel-lib/raid56.h   |  12 ++++
 4 files changed, 174 insertions(+), 3 deletions(-)
 create mode 100644 kernel-lib/mktables.c

diff --git a/.gitignore b/.gitignore
index 98b3657b..554e8921 100644
--- a/.gitignore
+++ b/.gitignore
@@ -35,6 +35,8 @@ btrfs-select-super
 btrfs-calc-size
 btrfs-crc
 btrfstune
+mktables
+kernel-lib/tables.c
 libbtrfs.a
 libbtrfs.so
 libbtrfs.so.0
diff --git a/Makefile.in b/Makefile.in
index 6e009bff..fecaaa6a 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -98,7 +98,8 @@ objects = ctree.o disk-io.o kernel-lib/radix-tree.o extent-tree.o print-tree.o \
 	  extent-cache.o extent_io.o volumes.o utils.o repair.o \
 	  qgroup.o raid56.o free-space-cache.o kernel-lib/list_sort.o props.o \
 	  ulist.o qgroup-verify.o backref.o string-table.o task-utils.o \
-	  inode.o file.o find-root.o free-space-tree.o help.o send-dump.o
+	  inode.o file.o find-root.o free-space-tree.o help.o send-dump.o \
+	  kernel-lib/tables.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
 	       cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
 	       cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
@@ -318,6 +319,14 @@ version.h: version.sh version.h.in configure.ac
 	@echo "    [SH]     $@"
 	$(Q)bash ./config.status --silent $@
 
+mktables: kernel-lib/mktables.c
+	@echo "    [CC]     $@"
+	$(Q)$(CC) $(CFLAGS) $< -o $@
+
+kernel-lib/tables.c: mktables
+	@echo "    [TABLE]  $@"
+	$(Q)./mktables > $@ || ($(RM) -f $@ && exit 1)
+	
 $(libs_shared): $(libbtrfs_objects) $(lib_links) send.h
 	@echo "    [LD]     $@"
 	$(Q)$(CC) $(CFLAGS) $(libbtrfs_objects) $(LDFLAGS) $(LIBBTRFS_LIBS) \
@@ -490,12 +499,12 @@ clean-all: clean clean-doc clean-gen
 clean: $(CLEANDIRS)
 	@echo "Cleaning"
 	$(Q)$(RM) -f -- $(progs) cscope.out *.o *.o.d \
-		kernel-lib/*.o kernel-lib/*.o.d \
+		kernel-lib/*.o kernel-lib/*.o.d kernel-lib/tables.c \
 		image/*.o image/*.o.d \
 		convert/*.o convert/*.o.d \
 		mkfs/*.o mkfs/*.o.d \
 	      dir-test ioctl-test quick-test library-test library-test-static \
-	      btrfs.static mkfs.btrfs.static \
+	      mktables btrfs.static mkfs.btrfs.static \
 	      $(check_defs) \
 	      $(libs) $(lib_links) \
 	      $(progs_static) $(progs_extra)
diff --git a/kernel-lib/mktables.c b/kernel-lib/mktables.c
new file mode 100644
index 00000000..85f621fe
--- /dev/null
+++ b/kernel-lib/mktables.c
@@ -0,0 +1,148 @@
+/* -*- linux-c -*- ------------------------------------------------------- *
+ *
+ *   Copyright 2002-2007 H. Peter Anvin - All Rights Reserved
+ *
+ *   This file is part of the Linux kernel, and is made available under
+ *   the terms of the GNU General Public License version 2 or (at your
+ *   option) any later version; incorporated herein by reference.
+ *
+ * ----------------------------------------------------------------------- */
+
+/*
+ * mktables.c
+ *
+ * Make RAID-6 tables.  This is a host user space program to be run at
+ * compile time.
+ */
+
+/*
+ * Btrfs-progs port, with following minor fixes:
+ * 1) Use "kerncompat.h"
+ * 2) Get rid of __KERNEL__ related macros
+ */
+
+#include <stdio.h>
+#include <string.h>
+#include <inttypes.h>
+#include <stdlib.h>
+#include <time.h>
+
+static uint8_t gfmul(uint8_t a, uint8_t b)
+{
+	uint8_t v = 0;
+
+	while (b) {
+		if (b & 1)
+			v ^= a;
+		a = (a << 1) ^ (a & 0x80 ? 0x1d : 0);
+		b >>= 1;
+	}
+
+	return v;
+}
+
+static uint8_t gfpow(uint8_t a, int b)
+{
+	uint8_t v = 1;
+
+	b %= 255;
+	if (b < 0)
+		b += 255;
+
+	while (b) {
+		if (b & 1)
+			v = gfmul(v, a);
+		a = gfmul(a, a);
+		b >>= 1;
+	}
+
+	return v;
+}
+
+int main(int argc, char *argv[])
+{
+	int i, j, k;
+	uint8_t v;
+	uint8_t exptbl[256], invtbl[256];
+
+	printf("#include \"kerncompat.h\"\n");
+
+	/* Compute multiplication table */
+	printf("\nconst u8  __attribute__((aligned(256)))\n"
+		"raid6_gfmul[256][256] =\n"
+		"{\n");
+	for (i = 0; i < 256; i++) {
+		printf("\t{\n");
+		for (j = 0; j < 256; j += 8) {
+			printf("\t\t");
+			for (k = 0; k < 8; k++)
+				printf("0x%02x,%c", gfmul(i, j + k),
+				       (k == 7) ? '\n' : ' ');
+		}
+		printf("\t},\n");
+	}
+	printf("};\n");
+
+	/* Compute vector multiplication table */
+	printf("\nconst u8  __attribute__((aligned(256)))\n"
+		"raid6_vgfmul[256][32] =\n"
+		"{\n");
+	for (i = 0; i < 256; i++) {
+		printf("\t{\n");
+		for (j = 0; j < 16; j += 8) {
+			printf("\t\t");
+			for (k = 0; k < 8; k++)
+				printf("0x%02x,%c", gfmul(i, j + k),
+				       (k == 7) ? '\n' : ' ');
+		}
+		for (j = 0; j < 16; j += 8) {
+			printf("\t\t");
+			for (k = 0; k < 8; k++)
+				printf("0x%02x,%c", gfmul(i, (j + k) << 4),
+				       (k == 7) ? '\n' : ' ');
+		}
+		printf("\t},\n");
+	}
+	printf("};\n");
+
+	/* Compute power-of-2 table (exponent) */
+	v = 1;
+	printf("\nconst u8 __attribute__((aligned(256)))\n"
+	       "raid6_gfexp[256] =\n" "{\n");
+	for (i = 0; i < 256; i += 8) {
+		printf("\t");
+		for (j = 0; j < 8; j++) {
+			exptbl[i + j] = v;
+			printf("0x%02x,%c", v, (j == 7) ? '\n' : ' ');
+			v = gfmul(v, 2);
+			if (v == 1)
+				v = 0;	/* For entry 255, not a real entry */
+		}
+	}
+	printf("};\n");
+
+	/* Compute inverse table x^-1 == x^254 */
+	printf("\nconst u8 __attribute__((aligned(256)))\n"
+	       "raid6_gfinv[256] =\n" "{\n");
+	for (i = 0; i < 256; i += 8) {
+		printf("\t");
+		for (j = 0; j < 8; j++) {
+			invtbl[i + j] = v = gfpow(i + j, 254);
+			printf("0x%02x,%c", v, (j == 7) ? '\n' : ' ');
+		}
+	}
+	printf("};\n");
+
+	/* Compute inv(2^x + 1) (exponent-xor-inverse) table */
+	printf("\nconst u8 __attribute__((aligned(256)))\n"
+	       "raid6_gfexi[256] =\n" "{\n");
+	for (i = 0; i < 256; i += 8) {
+		printf("\t");
+		for (j = 0; j < 8; j++)
+			printf("0x%02x,%c", invtbl[exptbl[i + j] ^ 1],
+			       (j == 7) ? '\n' : ' ');
+	}
+	printf("};\n");
+
+	return 0;
+}
diff --git a/kernel-lib/raid56.h b/kernel-lib/raid56.h
index 7d4f4678..1bf2e01a 100644
--- a/kernel-lib/raid56.h
+++ b/kernel-lib/raid56.h
@@ -25,4 +25,16 @@
 
 void raid6_gen_syndrome(int disks, size_t bytes, void **ptrs);
 int raid5_gen_result(int nr_devs, size_t stripe_len, int dest, void **data);
+
+/*
+ * Headers synchronized from kernel include/linux/raid/pq.h
+ * No modification at all.
+ *
+ * Galois field tables.
+ */
+extern const u8 raid6_gfmul[256][256] __attribute__((aligned(256)));
+extern const u8 raid6_vgfmul[256][32] __attribute__((aligned(256)));
+extern const u8 raid6_gfexp[256]      __attribute__((aligned(256)));
+extern const u8 raid6_gfinv[256]      __attribute__((aligned(256)));
+extern const u8 raid6_gfexi[256]      __attribute__((aligned(256)));
 #endif
-- 
2.11.0




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 03/19] btrfs-progs: raid56: Allow raid6 to recover 2 data stripes
  2016-12-26  6:29 [PATCH v2 00/19] Qu Wenruo
  2016-12-26  6:29 ` [PATCH v2 01/19] btrfs-progs: raid56: Introduce raid56 header for later recovery usage Qu Wenruo
  2016-12-26  6:29 ` [PATCH v2 02/19] btrfs-progs: raid56: Introduce tables for RAID6 recovery Qu Wenruo
@ 2016-12-26  6:29 ` Qu Wenruo
  2016-12-26  6:29 ` [PATCH v2 04/19] btrfs-progs: raid56: Allow raid6 to recover data and p Qu Wenruo
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Qu Wenruo @ 2016-12-26  6:29 UTC (permalink / raw)
  To: linux-btrfs

Copied from kernel lib/raid6/recov.c raid6_2data_recov_intx1() function.
With the following modification:
- Rename to raid6_recov_data2() for shorter name
- s/kfree/free/g modification

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 Makefile.in                     |  4 +--
 raid56.c => kernel-lib/raid56.c | 69 +++++++++++++++++++++++++++++++++++++++++
 kernel-lib/raid56.h             |  5 +++
 3 files changed, 76 insertions(+), 2 deletions(-)
 rename raid56.c => kernel-lib/raid56.c (71%)

diff --git a/Makefile.in b/Makefile.in
index fecaaa6a..c3f0eeda 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -96,10 +96,10 @@ CHECKER_FLAGS := -include $(check_defs) -D__CHECKER__ \
 objects = ctree.o disk-io.o kernel-lib/radix-tree.o extent-tree.o print-tree.o \
 	  root-tree.o dir-item.o file-item.o inode-item.o inode-map.o \
 	  extent-cache.o extent_io.o volumes.o utils.o repair.o \
-	  qgroup.o raid56.o free-space-cache.o kernel-lib/list_sort.o props.o \
+	  qgroup.o free-space-cache.o kernel-lib/list_sort.o props.o \
 	  ulist.o qgroup-verify.o backref.o string-table.o task-utils.o \
 	  inode.o file.o find-root.o free-space-tree.o help.o send-dump.o \
-	  kernel-lib/tables.o
+	  kernel-lib/tables.o kernel-lib/raid56.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
 	       cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
 	       cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
diff --git a/raid56.c b/kernel-lib/raid56.c
similarity index 71%
rename from raid56.c
rename to kernel-lib/raid56.c
index 8c79c456..dca8f8d4 100644
--- a/raid56.c
+++ b/kernel-lib/raid56.c
@@ -28,6 +28,7 @@
 #include "disk-io.h"
 #include "volumes.h"
 #include "utils.h"
+#include "kernel-lib/raid56.h"
 
 /*
  * This is the C data type to use
@@ -170,3 +171,71 @@ int raid5_gen_result(int nr_devs, size_t stripe_len, int dest, void **data)
 	}
 	return 0;
 }
+
+/*
+ * Raid 6 recovery code copied from kernel lib/raid6/recov.c.
+ * With modifications:
+ * - rename from raid6_2data_recov_intx1
+ * - kfree/free modification for btrfs-progs
+ */
+int raid6_recov_data2(int nr_devs, size_t stripe_len, int dest1, int dest2,
+		      void **data)
+{
+	u8 *p, *q, *dp, *dq;
+	u8 px, qx, db;
+	const u8 *pbmul;	/* P multiplier table for B data */
+	const u8 *qmul;		/* Q multiplier table (for both) */
+	char *zero_mem1, *zero_mem2;
+	int ret = 0;
+
+	/* Early check */
+	if (dest1 < 0 || dest1 >= nr_devs - 2 ||
+	    dest2 < 0 || dest2 >= nr_devs - 2 || dest1 >= dest2)
+		return -EINVAL;
+
+	zero_mem1 = calloc(1, stripe_len);
+	zero_mem2 = calloc(1, stripe_len);
+	if (!zero_mem1 || !zero_mem2) {
+		free(zero_mem1);
+		free(zero_mem2);
+		return -ENOMEM;
+	}
+
+	p = (u8 *)data[nr_devs - 2];
+	q = (u8 *)data[nr_devs - 1];
+
+	/* Compute syndrome with zero for the missing data pages
+	   Use the dead data pages as temporary storage for
+	   delta p and delta q */
+	dp = (u8 *)data[dest1];
+	data[dest1] = (void *)zero_mem1;
+	data[nr_devs - 2] = dp;
+	dq = (u8 *)data[dest2];
+	data[dest2] = (void *)zero_mem2;
+	data[nr_devs - 1] = dq;
+
+	raid6_gen_syndrome(nr_devs, stripe_len, data);
+
+	/* Restore pointer table */
+	data[dest1]   = dp;
+	data[dest2]   = dq;
+	data[nr_devs - 2] = p;
+	data[nr_devs - 1] = q;
+
+	/* Now, pick the proper data tables */
+	pbmul = raid6_gfmul[raid6_gfexi[dest2 - dest1]];
+	qmul  = raid6_gfmul[raid6_gfinv[raid6_gfexp[dest1]^raid6_gfexp[dest2]]];
+
+	/* Now do it... */
+	while ( stripe_len-- ) {
+		px    = *p ^ *dp;
+		qx    = qmul[*q ^ *dq];
+		*dq++ = db = pbmul[px] ^ qx; /* Reconstructed B */
+		*dp++ = db ^ px; /* Reconstructed A */
+		p++; q++;
+	}
+
+	free(zero_mem1);
+	free(zero_mem2);
+	return ret;
+}
diff --git a/kernel-lib/raid56.h b/kernel-lib/raid56.h
index 1bf2e01a..d397a23e 100644
--- a/kernel-lib/raid56.h
+++ b/kernel-lib/raid56.h
@@ -37,4 +37,9 @@ extern const u8 raid6_vgfmul[256][32] __attribute__((aligned(256)));
 extern const u8 raid6_gfexp[256]      __attribute__((aligned(256)));
 extern const u8 raid6_gfinv[256]      __attribute__((aligned(256)));
 extern const u8 raid6_gfexi[256]      __attribute__((aligned(256)));
+
+
+/* Recover raid6 with 2 data corrupted */
+int raid6_recov_data2(int nr_devs, size_t stripe_len, int dest1, int dest2,
+		      void **data);
 #endif
-- 
2.11.0




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 04/19] btrfs-progs: raid56: Allow raid6 to recover data and p
  2016-12-26  6:29 [PATCH v2 00/19] Qu Wenruo
                   ` (2 preceding siblings ...)
  2016-12-26  6:29 ` [PATCH v2 03/19] btrfs-progs: raid56: Allow raid6 to recover 2 data stripes Qu Wenruo
@ 2016-12-26  6:29 ` Qu Wenruo
  2016-12-26  6:29 ` [PATCH v2 05/19] btrfs-progs: Introduce wrapper to recover raid56 data Qu Wenruo
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Qu Wenruo @ 2016-12-26  6:29 UTC (permalink / raw)
  To: linux-btrfs

Copied from kernel lib/raid6/recov.c.

Minor modifications includes:
- Rename from raid6_datap_recov_intx() to raid5_recov_datap()
- Rename parameter from faila to dest1

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 kernel-lib/raid56.c | 41 +++++++++++++++++++++++++++++++++++++++++
 kernel-lib/raid56.h |  2 ++
 2 files changed, 43 insertions(+)

diff --git a/kernel-lib/raid56.c b/kernel-lib/raid56.c
index dca8f8d4..e078972b 100644
--- a/kernel-lib/raid56.c
+++ b/kernel-lib/raid56.c
@@ -239,3 +239,44 @@ int raid6_recov_data2(int nr_devs, size_t stripe_len, int dest1, int dest2,
 	free(zero_mem2);
 	return ret;
 }
+
+/*
+ * Raid 6 recover code copied from kernel lib/raid6/recov.c
+ * - rename from raid6_datap_recov_intx1()
+ * - parameter changed from faila to dest1
+ */
+int raid6_recov_datap(int nr_devs, size_t stripe_len, int dest1, void **data)
+{
+	u8 *p, *q, *dq;
+	const u8 *qmul;		/* Q multiplier table */
+	char *zero_mem;
+
+	p = (u8 *)data[nr_devs - 2];
+	q = (u8 *)data[nr_devs - 1];
+
+	zero_mem = calloc(1, stripe_len);
+	if (!zero_mem)
+		return -ENOMEM;
+
+	/* Compute syndrome with zero for the missing data page
+	   Use the dead data page as temporary storage for delta q */
+	dq = (u8 *)data[dest1];
+	data[dest1] = (void *)zero_mem;
+	data[nr_devs - 1] = dq;
+
+	raid6_gen_syndrome(nr_devs, stripe_len, data);
+
+	/* Restore pointer table */
+	data[dest1]   = dq;
+	data[nr_devs - 1] = q;
+
+	/* Now, pick the proper data tables */
+	qmul  = raid6_gfmul[raid6_gfinv[raid6_gfexp[dest1]]];
+
+	/* Now do it... */
+	while ( stripe_len-- ) {
+		*p++ ^= *dq = qmul[*q ^ *dq];
+		q++; dq++;
+	}
+	return 0;
+}
diff --git a/kernel-lib/raid56.h b/kernel-lib/raid56.h
index d397a23e..e088279b 100644
--- a/kernel-lib/raid56.h
+++ b/kernel-lib/raid56.h
@@ -42,4 +42,6 @@ extern const u8 raid6_gfexi[256]      __attribute__((aligned(256)));
 /* Recover raid6 with 2 data corrupted */
 int raid6_recov_data2(int nr_devs, size_t stripe_len, int dest1, int dest2,
 		      void **data);
+/* Recover data and P */
+int raid6_recov_datap(int nr_devs, size_t stripe_len, int dest1, void **data);
 #endif
-- 
2.11.0




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 05/19] btrfs-progs: Introduce wrapper to recover raid56 data
  2016-12-26  6:29 [PATCH v2 00/19] Qu Wenruo
                   ` (3 preceding siblings ...)
  2016-12-26  6:29 ` [PATCH v2 04/19] btrfs-progs: raid56: Allow raid6 to recover data and p Qu Wenruo
@ 2016-12-26  6:29 ` Qu Wenruo
  2016-12-26  6:29 ` [PATCH v2 06/19] btrfs-progs: Introduce new btrfs_map_block function which returns more unified result Qu Wenruo
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Qu Wenruo @ 2016-12-26  6:29 UTC (permalink / raw)
  To: linux-btrfs

Introduce a wrapper to recover raid56 data.

The logical is the same with kernel one, but with different interfaces,
since kernel ones cares the performance while in btrfs we don't care
that much.

And the interface is more caller friendly inside btrfs-progs.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 kernel-lib/raid56.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel-lib/raid56.h | 11 ++++++++
 2 files changed, 88 insertions(+)

diff --git a/kernel-lib/raid56.c b/kernel-lib/raid56.c
index e078972b..e3a9339e 100644
--- a/kernel-lib/raid56.c
+++ b/kernel-lib/raid56.c
@@ -280,3 +280,80 @@ int raid6_recov_datap(int nr_devs, size_t stripe_len, int dest1, void **data)
 	}
 	return 0;
 }
+
+/* Original raid56 recovery wrapper */
+int raid56_recov(int nr_devs, size_t stripe_len, u64 profile, int dest1,
+		 int dest2, void **data)
+{
+	int min_devs;
+	int ret;
+
+	if (profile & BTRFS_BLOCK_GROUP_RAID5)
+		min_devs = 2;
+	else if (profile & BTRFS_BLOCK_GROUP_RAID6)
+		min_devs = 3;
+	else
+		return -EINVAL;
+	if (nr_devs < min_devs)
+		return -EINVAL;
+
+	/* Nothing to recover */
+	if (dest1 == -1 && dest2 == -1)
+		return 0;
+
+	/* Reorder dest1/2, so only dest2 can be -1  */
+	if (dest1 == -1) {
+		dest1 = dest2;
+		dest2 = -1;
+	} else if (dest2 != -1 && dest1 != -1) {
+		/* Reorder dest1/2, ensure dest2 > dest1 */
+		if (dest1 > dest2) {
+			int tmp;
+
+			tmp = dest2;
+			dest2 = dest1;
+			dest1 = tmp;
+		}
+	}
+
+	if (profile & BTRFS_BLOCK_GROUP_RAID5) {
+		if (dest2 != -1)
+			return 1;
+		return raid5_gen_result(nr_devs, stripe_len, dest1, data);
+	}
+
+	/* RAID6 one dev corrupted case*/
+	if (dest2 == -1) {
+		/* Regenerate P/Q */
+		if (dest1 == nr_devs - 1 || dest1 == nr_devs - 2) {
+			raid6_gen_syndrome(nr_devs, stripe_len, data);
+			return 0;
+		}
+
+		/* Regerneate data from P */
+		return raid5_gen_result(nr_devs - 1, stripe_len, dest1, data);
+	}
+
+	/* P/Q bot corrupted */
+	if (dest1 == nr_devs - 2 && dest2 == nr_devs - 1) {
+		raid6_gen_syndrome(nr_devs, stripe_len, data);
+		return 0;
+	}
+
+	/* 2 Data corrupted */
+	if (dest2 < nr_devs - 2)
+		return raid6_recov_data2(nr_devs, stripe_len, dest1, dest2,
+					 data);
+	/* Data and P*/
+	if (dest2 == nr_devs - 1)
+		return raid6_recov_datap(nr_devs, stripe_len, dest1, data);
+
+	/*
+	 * Final case, Data and Q, recover data first then regenerate Q
+	 */
+	ret = raid5_gen_result(nr_devs - 1, stripe_len, dest1, data);
+	if (ret < 0)
+		return ret;
+	raid6_gen_syndrome(nr_devs, stripe_len, data);
+	return 0;
+}
diff --git a/kernel-lib/raid56.h b/kernel-lib/raid56.h
index e088279b..9aee39aa 100644
--- a/kernel-lib/raid56.h
+++ b/kernel-lib/raid56.h
@@ -44,4 +44,15 @@ int raid6_recov_data2(int nr_devs, size_t stripe_len, int dest1, int dest2,
 		      void **data);
 /* Recover data and P */
 int raid6_recov_datap(int nr_devs, size_t stripe_len, int dest1, void **data);
+
+/*
+ * Recover raid56 data
+ * @dest1/2 can be -1 to indicate correct data
+ *
+ * Return >0 for unrecoverable case.
+ * Return 0 for recoverable case, And recovered data will be stored into @data
+ * Return <0 for fatal error
+ */
+int raid56_recov(int nr_devs, size_t stripe_len, u64 profile, int dest1,
+		 int dest2, void **data);
 #endif
-- 
2.11.0




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 06/19] btrfs-progs: Introduce new btrfs_map_block function which returns more unified result.
  2016-12-26  6:29 [PATCH v2 00/19] Qu Wenruo
                   ` (4 preceding siblings ...)
  2016-12-26  6:29 ` [PATCH v2 05/19] btrfs-progs: Introduce wrapper to recover raid56 data Qu Wenruo
@ 2016-12-26  6:29 ` Qu Wenruo
  2017-02-24  0:37   ` Liu Bo
  2016-12-26  6:29 ` [PATCH v2 07/19] btrfs-progs: Allow __btrfs_map_block_v2 to remove unrelated stripes Qu Wenruo
                   ` (14 subsequent siblings)
  20 siblings, 1 reply; 27+ messages in thread
From: Qu Wenruo @ 2016-12-26  6:29 UTC (permalink / raw)
  To: linux-btrfs

Introduce a new function, __btrfs_map_block_v2().

Unlike old btrfs_map_block(), which needs different parameter to handle
different RAID profile, this new function uses unified btrfs_map_block
structure to handle all RAID profile in a more meaningful method:

Return physical address along with logical address for each stripe.

For RAID1/Single/DUP (none-stripped):
result would be like:
Map block: Logical 128M, Len 10M, Type RAID1, Stripe len 0, Nr_stripes 2
Stripe 0: Logical 128M, Physical X, Len: 10M Dev dev1
Stripe 1: Logical 128M, Physical Y, Len: 10M Dev dev2

Result will be as long as possible, since it's not stripped at all.

For RAID0/10 (stripped without parity):
Result will be aligned to full stripe size:
Map block: Logical 64K, Len 128K, Type RAID10, Stripe len 64K, Nr_stripes 4
Stripe 0: Logical 64K, Physical X, Len 64K Dev dev1
Stripe 1: Logical 64K, Physical Y, Len 64K Dev dev2
Stripe 2: Logical 128K, Physical Z, Len 64K Dev dev3
Stripe 3: Logical 128K, Physical W, Len 64K Dev dev4

For RAID5/6 (stripped with parity and dev-rotation)
Result will be aligned to full stripe size:
Map block: Logical 64K, Len 128K, Type RAID6, Stripe len 64K, Nr_stripes 4
Stripe 0: Logical 64K, Physical X, Len 64K Dev dev1
Stripe 1: Logical 128K, Physical Y, Len 64K Dev dev2
Stripe 2: Logical RAID5_P, Physical Z, Len 64K Dev dev3
Stripe 3: Logical RAID6_Q, Physical W, Len 64K Dev dev4

The new unified layout should be very flex and can even handle things
like N-way RAID1 (which old mirror_num basic one can't handle well).

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 volumes.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 volumes.h |  49 +++++++++++++++++
 2 files changed, 230 insertions(+)

diff --git a/volumes.c b/volumes.c
index f17bdeed..11d1f0e8 100644
--- a/volumes.c
+++ b/volumes.c
@@ -1593,6 +1593,187 @@ out:
 	return 0;
 }
 
+static inline struct btrfs_map_block *alloc_map_block(int num_stripes)
+{
+	struct btrfs_map_block *ret;
+	int size;
+
+	size = sizeof(struct btrfs_map_stripe) * num_stripes +
+		sizeof(struct btrfs_map_block);
+	ret = malloc(size);
+	if (!ret)
+		return NULL;
+	memset(ret, 0, size);
+	return ret;
+}
+
+static int fill_full_map_block(struct map_lookup *map, u64 start, u64 length,
+			       struct btrfs_map_block *map_block)
+{
+	u64 profile = map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
+	u64 bg_start = map->ce.start;
+	u64 bg_end = bg_start + map->ce.size;
+	u64 bg_offset = start - bg_start; /* offset inside the block group */
+	u64 fstripe_logical = 0;	/* Full stripe start logical bytenr */
+	u64 fstripe_size = 0;		/* Full stripe logical size */
+	u64 fstripe_phy_off = 0;	/* Full stripe offset in each dev */
+	u32 stripe_len = map->stripe_len;
+	int sub_stripes = map->sub_stripes;
+	int data_stripes = nr_data_stripes(map);
+	int dev_rotation;
+	int i;
+
+	map_block->num_stripes = map->num_stripes;
+	map_block->type = profile;
+
+	/*
+	 * Common full stripe data for stripe based profiles
+	 */
+	if (profile & (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID10 |
+		       BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6)) {
+		fstripe_size = stripe_len * data_stripes;
+		if (sub_stripes)
+			fstripe_size /= sub_stripes;
+		fstripe_logical = bg_offset / fstripe_size * fstripe_size +
+				    bg_start;
+		fstripe_phy_off = bg_offset / fstripe_size * stripe_len;
+	}
+
+	switch (profile) {
+	case BTRFS_BLOCK_GROUP_DUP:
+	case BTRFS_BLOCK_GROUP_RAID1:
+	case 0: /* SINGLE */
+		/*
+		 * None-stripe mode,(Single, DUP and RAID1)
+		 * Just use offset to fill map_block
+		 */
+		map_block->stripe_len = 0;
+		map_block->start = start;
+		map_block->length = min(bg_end, start + length) - start;
+		for (i = 0; i < map->num_stripes; i++) {
+			struct btrfs_map_stripe *stripe;
+
+			stripe = &map_block->stripes[i];
+
+			stripe->dev = map->stripes[i].dev;
+			stripe->logical = start;
+			stripe->physical = map->stripes[i].physical + bg_offset;
+			stripe->length = map_block->length;
+		}
+		break;
+	case BTRFS_BLOCK_GROUP_RAID10:
+	case BTRFS_BLOCK_GROUP_RAID0:
+		/*
+		 * Stripe modes without parity(0 and 10)
+		 * Return the whole full stripe
+		 */
+
+		map_block->start = fstripe_logical;
+		map_block->length = fstripe_size;
+		map_block->stripe_len = map->stripe_len;
+		for (i = 0; i < map->num_stripes; i++) {
+			struct btrfs_map_stripe *stripe;
+			u64 cur_offset;
+
+			/* Handle RAID10 sub stripes */
+			if (sub_stripes)
+				cur_offset = i / sub_stripes * stripe_len;
+			else
+				cur_offset = stripe_len * i;
+			stripe = &map_block->stripes[i];
+
+			stripe->dev = map->stripes[i].dev;
+			stripe->logical = fstripe_logical + cur_offset;
+			stripe->length = stripe_len;
+			stripe->physical = map->stripes[i].physical +
+					   fstripe_phy_off;
+		}
+		break;
+	case BTRFS_BLOCK_GROUP_RAID5:
+	case BTRFS_BLOCK_GROUP_RAID6:
+		/*
+		 * Stripe modes with parity and device rotation(5 and 6)
+		 *
+		 * Return the whole full stripe
+		 */
+
+		dev_rotation = (bg_offset / fstripe_size) % map->num_stripes;
+
+		map_block->start = fstripe_logical;
+		map_block->length = fstripe_size;
+		map_block->stripe_len = map->stripe_len;
+		for (i = 0; i < map->num_stripes; i++) {
+			struct btrfs_map_stripe *stripe;
+			int dest_index;
+			u64 cur_offset = stripe_len * i;
+
+			stripe = &map_block->stripes[i];
+
+			dest_index = (i + dev_rotation) % map->num_stripes;
+			stripe->dev = map->stripes[dest_index].dev;
+			stripe->length = stripe_len;
+			stripe->physical = map->stripes[dest_index].physical +
+					   fstripe_phy_off;
+			if (i < data_stripes) {
+				/* data stripe */
+				stripe->logical = fstripe_logical +
+						  cur_offset;
+			} else if (i == data_stripes) {
+				/* P */
+				stripe->logical = BTRFS_RAID5_P_STRIPE;
+			} else {
+				/* Q */
+				stripe->logical = BTRFS_RAID6_Q_STRIPE;
+			}
+		}
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+int __btrfs_map_block_v2(struct btrfs_fs_info *fs_info, int rw, u64 logical,
+			 u64 length, struct btrfs_map_block **map_ret)
+{
+	struct cache_extent *ce;
+	struct map_lookup *map;
+	struct btrfs_map_block *map_block;
+	int ret;
+
+	/* Eearly parameter check */
+	if (!length || !map_ret) {
+		error("wrong parameter for %s", __func__);
+		return -EINVAL;
+	}
+
+	ce = search_cache_extent(&fs_info->mapping_tree.cache_tree, logical);
+	if (!ce)
+		return -ENOENT;
+	if (ce->start > logical)
+		return -ENOENT;
+
+	map = container_of(ce, struct map_lookup, ce);
+	/*
+	 * Allocate a full map_block anyway
+	 *
+	 * For write, we need the full map_block anyway.
+	 * For read, it will be striped to the needed stripe before returning.
+	 */
+	map_block = alloc_map_block(map->num_stripes);
+	if (!map_block)
+		return -ENOMEM;
+	ret = fill_full_map_block(map, logical, length, map_block);
+	if (ret < 0) {
+		free(map_block);
+		return ret;
+	}
+	/* TODO: Remove unrelated map_stripes for READ operation */
+
+	*map_ret = map_block;
+	return 0;
+}
+
 struct btrfs_device *btrfs_find_device(struct btrfs_root *root, u64 devid,
 				       u8 *uuid, u8 *fsid)
 {
diff --git a/volumes.h b/volumes.h
index ee7d56ab..0a575557 100644
--- a/volumes.h
+++ b/volumes.h
@@ -108,6 +108,51 @@ struct map_lookup {
 	struct btrfs_bio_stripe stripes[];
 };
 
+struct btrfs_map_stripe {
+	struct btrfs_device *dev;
+
+	/*
+	 * Logical address of the stripe start.
+	 * Caller should check if this logical is the desired map start.
+	 * It's possible that the logical is smaller or larger than desired
+	 * map range.
+	 *
+	 * For P/Q stipre, it will be BTRFS_RAID5_P_STRIPE
+	 * and BTRFS_RAID6_Q_STRIPE.
+	 */
+	u64 logical;
+
+	u64 physical;
+
+	/* The length of the stripe */
+	u64 length;
+};
+
+struct btrfs_map_block {
+	/*
+	 * The logical start of the whole map block.
+	 * For RAID5/6 it will be the bytenr of the full stripe start,
+	 * so it's possible that @start is smaller than desired map range
+	 * start.
+	 */
+	u64 start;
+
+	/*
+	 * The logical length of the map block.
+	 * For RAID5/6 it will be total data stripe size
+	 */
+	u64 length;
+
+	/* Block group type */
+	u64 type;
+
+	/* Stripe length, for non-stripped mode, it will be 0 */
+	u32 stripe_len;
+
+	int num_stripes;
+	struct btrfs_map_stripe stripes[];
+};
+
 #define btrfs_multi_bio_size(n) (sizeof(struct btrfs_multi_bio) + \
 			    (sizeof(struct btrfs_bio_stripe) * (n)))
 #define btrfs_map_lookup_size(n) (sizeof(struct map_lookup) + \
@@ -187,6 +232,10 @@ int btrfs_map_block(struct btrfs_mapping_tree *map_tree, int rw,
 		    u64 logical, u64 *length,
 		    struct btrfs_multi_bio **multi_ret, int mirror_num,
 		    u64 **raid_map_ret);
+
+/* TODO: Use this map_block_v2 to replace __btrfs_map_block() */
+int __btrfs_map_block_v2(struct btrfs_fs_info *fs_info, int rw, u64 logical,
+			 u64 length, struct btrfs_map_block **map_ret);
 int btrfs_next_bg(struct btrfs_mapping_tree *map_tree, u64 *logical,
 		     u64 *size, u64 type);
 static inline int btrfs_next_bg_metadata(struct btrfs_mapping_tree *map_tree,
-- 
2.11.0




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 07/19] btrfs-progs: Allow __btrfs_map_block_v2 to remove unrelated stripes
  2016-12-26  6:29 [PATCH v2 00/19] Qu Wenruo
                   ` (5 preceding siblings ...)
  2016-12-26  6:29 ` [PATCH v2 06/19] btrfs-progs: Introduce new btrfs_map_block function which returns more unified result Qu Wenruo
@ 2016-12-26  6:29 ` Qu Wenruo
  2016-12-26  6:29 ` [PATCH v2 08/19] btrfs-progs: csum: Introduce function to read out one data csum Qu Wenruo
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Qu Wenruo @ 2016-12-26  6:29 UTC (permalink / raw)
  To: linux-btrfs

For READ, caller normally hopes to get what they request, other than
full stripe map.

In this case, we should remove unrelated stripe map, just like the
following case:
               32K               96K
               |<-request range->|
         0              64k           128K
RAID0:   |    Data 1    |   Data 2    |
              disk1         disk2
Before this patch, we return the full stripe:
Stripe 0: Logical 0, Physical X, Len 64K, Dev disk1
Stripe 1: Logical 64k, Physical Y, Len 64K, Dev disk2

After this patch, we limit the stripe result to the request range:
Stripe 0: Logical 32K, Physical X+32K, Len 32K, Dev disk1
Stripe 1: Logical 64k, Physical Y, Len 32K, Dev disk2

And if it's a RAID5/6 stripe, we just handle it like RAID0, ignoring
parities.

This should make caller easier to use.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 volumes.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 102 insertions(+), 1 deletion(-)

diff --git a/volumes.c b/volumes.c
index 11d1f0e8..79b877c9 100644
--- a/volumes.c
+++ b/volumes.c
@@ -1733,6 +1733,107 @@ static int fill_full_map_block(struct map_lookup *map, u64 start, u64 length,
 	return 0;
 }
 
+static void del_one_stripe(struct btrfs_map_block *map_block, int i)
+{
+	int cur_nr = map_block->num_stripes;
+	int size_left = (cur_nr - 1 - i) * sizeof(struct btrfs_map_stripe);
+
+	memmove(&map_block->stripes[i], &map_block->stripes[i + 1], size_left);
+	map_block->num_stripes--;
+}
+
+static void remove_unrelated_stripes(struct map_lookup *map,
+				     int rw, u64 start, u64 length,
+				     struct btrfs_map_block *map_block)
+{
+	int i = 0;
+	/*
+	 * RAID5/6 write must use full stripe.
+	 * No need to do anything.
+	 */
+	if (map->type & (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6) &&
+	    rw == WRITE)
+		return;
+
+	/*
+	 * For RAID0/1/10/DUP, whatever read/write, we can remove unrelated
+	 * stripes without causing anything wrong.
+	 * RAID5/6 READ is just like RAID0, we don't care parity unless we need
+	 * to recovery.
+	 * For recovery, rw should be set to WRITE.
+	 */
+	while (i < map_block->num_stripes) {
+		struct btrfs_map_stripe *stripe;
+		u64 orig_logical; /* Original stripe logical start */
+		u64 orig_end; /* Original stripe logical end */
+
+		stripe = &map_block->stripes[i];
+
+		/*
+		 * For READ, we don't really care parity
+		 */
+		if (stripe->logical == BTRFS_RAID5_P_STRIPE ||
+		    stripe->logical == BTRFS_RAID6_Q_STRIPE) {
+			del_one_stripe(map_block, i);
+			continue;
+		}
+		/* Completely unrelated stripe */
+		if (stripe->logical >= start + length ||
+		    stripe->logical + stripe->length <= start) {
+			del_one_stripe(map_block, i);
+			continue;
+		}
+		/* Covered stripe, modify its logical and physical */
+		orig_logical = stripe->logical;
+		orig_end = stripe->logical + stripe->length;
+		if (start + length <= orig_end) {
+			/*
+			 * |<--range-->|
+			 *   |  stripe   |
+			 * Or
+			 *     |<range>|
+			 *   |  stripe   |
+			 */
+			stripe->logical = max(orig_logical, start);
+			stripe->length = start + length;
+			stripe->physical += stripe->logical - orig_logical;
+		} else if (start >= orig_logical) {
+			/*
+			 *     |<-range--->|
+			 * |  stripe     |
+			 * Or
+			 *     |<range>|
+			 * |  stripe     |
+			 */
+			stripe->logical = start;
+			stripe->length = min(orig_end, start + length);
+			stripe->physical += stripe->logical - orig_logical;
+		}
+		/*
+		 * Remaining case:
+		 * |<----range----->|
+		 *   | stripe |
+		 * No need to do any modification
+		 */
+		i++;
+	}
+
+	/* Recaculate map_block size */
+	map_block->start = 0;
+	map_block->length = 0;
+	for (i = 0; i < map_block->num_stripes; i++) {
+		struct btrfs_map_stripe *stripe;
+
+		stripe = &map_block->stripes[i];
+		if (stripe->logical > map_block->start)
+			map_block->start = stripe->logical;
+		if (stripe->logical + stripe->length >
+		    map_block->start + map_block->length)
+			map_block->length = stripe->logical + stripe->length -
+					    map_block->start;
+	}
+}
+
 int __btrfs_map_block_v2(struct btrfs_fs_info *fs_info, int rw, u64 logical,
 			 u64 length, struct btrfs_map_block **map_ret)
 {
@@ -1768,7 +1869,7 @@ int __btrfs_map_block_v2(struct btrfs_fs_info *fs_info, int rw, u64 logical,
 		free(map_block);
 		return ret;
 	}
-	/* TODO: Remove unrelated map_stripes for READ operation */
+	remove_unrelated_stripes(map, rw, logical, length, map_block);
 
 	*map_ret = map_block;
 	return 0;
-- 
2.11.0




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 08/19] btrfs-progs: csum: Introduce function to read out one data csum
  2016-12-26  6:29 [PATCH v2 00/19] Qu Wenruo
                   ` (6 preceding siblings ...)
  2016-12-26  6:29 ` [PATCH v2 07/19] btrfs-progs: Allow __btrfs_map_block_v2 to remove unrelated stripes Qu Wenruo
@ 2016-12-26  6:29 ` Qu Wenruo
  2016-12-26  6:29 ` [PATCH v2 09/19] btrfs-progs: scrub: Introduce structures to support fsck scrub for RAID56 Qu Wenruo
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Qu Wenruo @ 2016-12-26  6:29 UTC (permalink / raw)
  To: linux-btrfs

Introduce a new function: btrfs_read_one_data_csum(), to read out a csum
for a sectorsize.

This is quite useful for read out data csum so we don't need to do it
using open code.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 Makefile.in |  2 +-
 csum.c      | 96 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 ctree.h     |  3 ++
 3 files changed, 100 insertions(+), 1 deletion(-)
 create mode 100644 csum.c

diff --git a/Makefile.in b/Makefile.in
index c3f0eeda..bb619bfe 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -99,7 +99,7 @@ objects = ctree.o disk-io.o kernel-lib/radix-tree.o extent-tree.o print-tree.o \
 	  qgroup.o free-space-cache.o kernel-lib/list_sort.o props.o \
 	  ulist.o qgroup-verify.o backref.o string-table.o task-utils.o \
 	  inode.o file.o find-root.o free-space-tree.o help.o send-dump.o \
-	  kernel-lib/tables.o kernel-lib/raid56.o
+	  kernel-lib/tables.o kernel-lib/raid56.o csum.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
 	       cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
 	       cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
diff --git a/csum.c b/csum.c
new file mode 100644
index 00000000..53195eaf
--- /dev/null
+++ b/csum.c
@@ -0,0 +1,96 @@
+/*
+ * Copyright (C) 2016 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include "ctree.h"
+#include "utils.h"
+/*
+ * TODO:
+ * 1) Add write support for csum
+ *    So we can write new data extents and add csum into csum tree
+ * 2) Add csum range search function
+ *    So we don't need to search csum tree in a per-sectorsize loop.
+ */
+
+int btrfs_read_one_data_csum(struct btrfs_fs_info *fs_info, u64 bytenr,
+			     void *csum_ret)
+{
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	struct btrfs_root *csum_root = fs_info->csum_root;
+	u32 item_offset;
+	u32 item_size;
+	u32 final_offset;
+	u32 sectorsize = fs_info->tree_root->sectorsize;
+	u16 csum_size = btrfs_super_csum_size(fs_info->super_copy);
+	int ret;
+
+	if (!csum_ret) {
+		error("wrong parameter for %s", __func__);
+		return -EINVAL;
+	}
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	key.objectid = BTRFS_EXTENT_CSUM_OBJECTID;
+	key.type = BTRFS_EXTENT_CSUM_KEY;
+	key.offset = bytenr;
+
+	ret = btrfs_search_slot(NULL, csum_root, &key, path, 0, 0);
+	if (ret < 0)
+		goto out;
+	if (ret == 0) {
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+		if (!IS_ALIGNED(key.offset, sectorsize)) {
+			error("csum item bytenr %llu is not aligned to %u",
+			      key.offset, sectorsize);
+			ret = -EIO;
+			goto out;
+		}
+		u32 offset = btrfs_item_ptr_offset(path->nodes[0],
+						      path->slots[0]);
+
+		read_extent_buffer(path->nodes[0], csum_ret, offset, csum_size);
+		goto out;
+	}
+	ret = btrfs_previous_item(csum_root, path, BTRFS_EXTENT_CSUM_OBJECTID,
+				  BTRFS_EXTENT_CSUM_KEY);
+	if (ret)
+		goto out;
+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+	if (!IS_ALIGNED(key.offset, sectorsize)) {
+		error("csum item bytenr %llu is not aligned to %u",
+		      key.offset, sectorsize);
+		ret = -EIO;
+		goto out;
+	}
+	item_offset = btrfs_item_ptr_offset(path->nodes[0], path->slots[0]);
+	item_size = btrfs_item_size_nr(path->nodes[0], path->slots[0]);
+	if (key.offset + item_size / csum_size * sectorsize <= bytenr) {
+		ret = 1;
+		goto out;
+	}
+
+	final_offset = (bytenr - key.offset) / sectorsize * csum_size +
+		       item_offset;
+	read_extent_buffer(path->nodes[0], csum_ret, final_offset, csum_size);
+	ret = 0;
+out:
+	btrfs_free_path(path);
+	return ret;
+};
diff --git a/ctree.h b/ctree.h
index dd02ef86..506b107e 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2798,4 +2798,7 @@ int btrfs_get_extent(struct btrfs_trans_handle *trans,
 int btrfs_punch_hole(struct btrfs_trans_handle *trans,
 		     struct btrfs_root *root,
 		     u64 ino, u64 offset, u64 len);
+/* csum.c */
+int btrfs_read_one_data_csum(struct btrfs_fs_info *fs_info, u64 bytenr,
+			     void *csum_ret);
 #endif
-- 
2.11.0




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 09/19] btrfs-progs: scrub: Introduce structures to support fsck scrub for RAID56
  2016-12-26  6:29 [PATCH v2 00/19] Qu Wenruo
                   ` (7 preceding siblings ...)
  2016-12-26  6:29 ` [PATCH v2 08/19] btrfs-progs: csum: Introduce function to read out one data csum Qu Wenruo
@ 2016-12-26  6:29 ` Qu Wenruo
  2016-12-26  6:29 ` [PATCH v2 10/19] btrfs-progs: scrub: Introduce function to scrub mirror based tree block Qu Wenruo
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Qu Wenruo @ 2016-12-26  6:29 UTC (permalink / raw)
  To: linux-btrfs

Introuduce new local structures, scrub_full_stripe and scrub_stripe, for
incoming offline RAID56 scrub support.

For pure stripe/mirror based profiles, like raid0/1/10/dup/single, we
will follow the original bytenr and mirror number based iteration, so
they don't need any extra structures for these profiles.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 Makefile.in |   2 +-
 scrub.c     | 120 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 121 insertions(+), 1 deletion(-)
 create mode 100644 scrub.c

diff --git a/Makefile.in b/Makefile.in
index bb619bfe..41da7ab5 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -99,7 +99,7 @@ objects = ctree.o disk-io.o kernel-lib/radix-tree.o extent-tree.o print-tree.o \
 	  qgroup.o free-space-cache.o kernel-lib/list_sort.o props.o \
 	  ulist.o qgroup-verify.o backref.o string-table.o task-utils.o \
 	  inode.o file.o find-root.o free-space-tree.o help.o send-dump.o \
-	  kernel-lib/tables.o kernel-lib/raid56.o csum.o
+	  kernel-lib/tables.o kernel-lib/raid56.o csum.o scrub.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
 	       cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
 	       cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
diff --git a/scrub.c b/scrub.c
new file mode 100644
index 00000000..c9ca817e
--- /dev/null
+++ b/scrub.c
@@ -0,0 +1,120 @@
+/*
+ * Copyright (C) 2016 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+/*
+ * Main part to implement offline(unmounted) btrfs scrub
+ */
+
+#include <unistd.h>
+#include "ctree.h"
+#include "volumes.h"
+#include "disk-io.h"
+#include "utils.h"
+
+/*
+ * For parity based profile(RAID56)
+ * Mirror/stripe based on won't need this. They are iterated by bytenr and
+ * mirror number.
+ */
+struct scrub_stripe {
+	/* For P/Q logical start will be BTRFS_RAID5/6_P/Q_STRIPE */
+	u64 logical;
+
+	/* Device is missing */
+	unsigned int dev_missing:1;
+
+	/* Any tree/data csum mismatches */
+	unsigned int csum_mismatch:1;
+
+	/* Some data doesn't have csum(nodatasum) */
+	unsigned int csum_missing:1;
+
+	char *data;
+};
+
+/*
+ * RAID56 full stripe(data stripes + P/Q)
+ */
+struct scrub_full_stripe {
+	u64 logical_start;
+	u64 logical_len;
+	u64 bg_type;
+	u32 nr_stripes;
+	u32 stripe_len;
+
+	/* Read error stripes */
+	u32 err_read_stripes;
+
+	/* Missing devices */
+	u32 err_missing_devs;
+
+	/* Csum error data stripes */
+	u32 err_csum_dstripes;
+
+	/* Missing csum data stripes */
+	u32 missing_csum_dstripes;
+
+	/* currupted stripe index */
+	int corrupted_index[2];
+
+	int nr_corrupted_stripes;
+
+	/* Already recovered once? */
+	unsigned int recovered:1;
+
+	struct scrub_stripe stripes[];
+};
+
+static void free_full_stripe(struct scrub_full_stripe *fstripe)
+{
+	int i;
+
+	for (i = 0; i < fstripe->nr_stripes; i++)
+		free(fstripe->stripes[i].data);
+	free(fstripe);
+}
+
+static struct scrub_full_stripe *alloc_full_stripe(int nr_stripes,
+						    u32 stripe_len)
+{
+	struct scrub_full_stripe *ret;
+	int size = sizeof(*ret) + nr_stripes * sizeof(struct scrub_stripe);
+	int i;
+
+	ret = malloc(size);
+	if (!ret)
+		return NULL;
+
+	memset(ret, 0, size);
+	ret->nr_stripes = nr_stripes;
+	ret->stripe_len = stripe_len;
+	ret->corrupted_index[0] = -1;
+	ret->corrupted_index[1] = -1;
+
+	/* Alloc data memory for each stripe */
+	for (i = 0; i < nr_stripes; i++) {
+		struct scrub_stripe *stripe = &ret->stripes[i];
+
+		stripe->data = malloc(stripe_len);
+		if (!stripe->data) {
+			free_full_stripe(ret);
+			return NULL;
+		}
+	}
+	return ret;
+}
-- 
2.11.0




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 10/19] btrfs-progs: scrub: Introduce function to scrub mirror based tree block
  2016-12-26  6:29 [PATCH v2 00/19] Qu Wenruo
                   ` (8 preceding siblings ...)
  2016-12-26  6:29 ` [PATCH v2 09/19] btrfs-progs: scrub: Introduce structures to support fsck scrub for RAID56 Qu Wenruo
@ 2016-12-26  6:29 ` Qu Wenruo
  2016-12-26  6:29 ` [PATCH v2 11/19] btrfs-progs: scrub: Introduce function to scrub mirror based data blocks Qu Wenruo
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Qu Wenruo @ 2016-12-26  6:29 UTC (permalink / raw)
  To: linux-btrfs

Introduce a new function, scrub_tree_mirror(), to scrub mirror based
tree blocks (Single/DUP/RAID0/1/10)

This function can also be used on in-memory tree blocks using @data
parameter.
This is very handy for RAID5/6 case, either checking the data stripe
tree block by @bytenr and 0 as @mirror, or using @data parameter for
recovered in-memory data.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 disk-io.c |  4 ++--
 disk-io.h |  2 ++
 scrub.c   | 72 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 76 insertions(+), 2 deletions(-)

diff --git a/disk-io.c b/disk-io.c
index 9140a81b..d5011572 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -51,8 +51,8 @@ static u32 max_nritems(u8 level, u32 nodesize)
 		sizeof(struct btrfs_key_ptr));
 }
 
-static int check_tree_block(struct btrfs_fs_info *fs_info,
-			    struct extent_buffer *buf)
+int check_tree_block(struct btrfs_fs_info *fs_info,
+		     struct extent_buffer *buf)
 {
 
 	struct btrfs_fs_devices *fs_devices;
diff --git a/disk-io.h b/disk-io.h
index 4de9fef7..db883d57 100644
--- a/disk-io.h
+++ b/disk-io.h
@@ -119,6 +119,8 @@ static inline struct extent_buffer* read_tree_block(
 			parent_transid);
 }
 
+int check_tree_block(struct btrfs_fs_info *fs_info,
+		     struct extent_buffer *buf);
 int read_extent_data(struct btrfs_root *root, char *data, u64 logical,
 		     u64 *len, int mirror);
 void readahead_tree_block(struct btrfs_root *root, u64 bytenr, u32 blocksize,
diff --git a/scrub.c b/scrub.c
index c9ca817e..4cf678fb 100644
--- a/scrub.c
+++ b/scrub.c
@@ -118,3 +118,75 @@ static struct scrub_full_stripe *alloc_full_stripe(int nr_stripes,
 	}
 	return ret;
 }
+
+static inline int is_data_stripe(struct scrub_stripe *stripe)
+{
+	u64 bytenr = stripe->logical;
+
+	if (bytenr == BTRFS_RAID5_P_STRIPE || bytenr == BTRFS_RAID6_Q_STRIPE)
+		return 0;
+	return 1;
+}
+
+/*
+ * Scrub one tree mirror given by @bytenr and @mirror, or @data.
+ * If @data is not given(NULL), the function will try to read out tree block
+ * using @bytenr and @mirror.
+ * If @data is given, use data directly, won't try to read from disk.
+ *
+ * The extra @data prameter is handy for RAID5/6 recovery code to verify
+ * the recovered data.
+ *
+ * Return 0 if everything is OK.
+ * Return <0 something goes wrong, and @scrub_ctx accounting will be updated
+ * if it's a data corruption.
+ */
+static int scrub_tree_mirror(struct btrfs_fs_info *fs_info,
+			     struct btrfs_scrub_progress *scrub_ctx,
+			     char *data, u64 bytenr, int mirror)
+{
+	struct extent_buffer *eb;
+	u32 nodesize = fs_info->tree_root->nodesize;
+	int ret;
+
+	if (!IS_ALIGNED(bytenr, fs_info->tree_root->sectorsize)) {
+		/* Such error will be reported by check_tree_block() */
+		scrub_ctx->verify_errors++;
+		return -EIO;
+	}
+
+	eb = btrfs_find_create_tree_block(fs_info, bytenr, nodesize);
+	if (!eb)
+		return -ENOMEM;
+	if (data) {
+		memcpy(eb->data, data, nodesize);
+	} else {
+		ret = read_whole_eb(fs_info, eb, mirror);
+		if (ret) {
+			scrub_ctx->read_errors++;
+			error("failed to read tree block %llu mirror %d",
+			      bytenr, mirror);
+			goto out;
+		}
+	}
+
+	scrub_ctx->tree_bytes_scrubbed += nodesize;
+	if (csum_tree_block(fs_info->tree_root, eb, 1)) {
+		error("tree block %llu mirror %d checksum mismatch", bytenr,
+			mirror);
+		scrub_ctx->csum_errors++;
+		ret = -EIO;
+		goto out;
+	}
+	ret = check_tree_block(fs_info, eb);
+	if (ret < 0) {
+		error("tree block %llu mirror %d is invalid", bytenr, mirror);
+		scrub_ctx->verify_errors++;
+		goto out;
+	}
+
+	scrub_ctx->tree_extents_scrubbed++;
+out:
+	free_extent_buffer(eb);
+	return ret;
+}
-- 
2.11.0




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 11/19] btrfs-progs: scrub: Introduce function to scrub mirror based data blocks
  2016-12-26  6:29 [PATCH v2 00/19] Qu Wenruo
                   ` (9 preceding siblings ...)
  2016-12-26  6:29 ` [PATCH v2 10/19] btrfs-progs: scrub: Introduce function to scrub mirror based tree block Qu Wenruo
@ 2016-12-26  6:29 ` Qu Wenruo
  2016-12-26  6:29 ` [PATCH v2 12/19] btrfs-progs: scrub: Introduce function to scrub one extent Qu Wenruo
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Qu Wenruo @ 2016-12-26  6:29 UTC (permalink / raw)
  To: linux-btrfs

Introduce a new function, scrub_data_mirror(), to check mirror based
data blocks.

It can also accept @data parameter to use in-memory data instead of
reading them out of disk.
This is a handy feature for RAID5/6 recovery verification code.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 scrub.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 82 insertions(+)

diff --git a/scrub.c b/scrub.c
index 4cf678fb..2563f407 100644
--- a/scrub.c
+++ b/scrub.c
@@ -190,3 +190,85 @@ out:
 	free_extent_buffer(eb);
 	return ret;
 }
+
+/*
+ * Scrub one data mirror given by @start @len and @mirror, or @data
+ * If @data is not given, try to read it from disk.
+ * This function will try to read out all the data then check sum.
+ *
+ * If @data is given, just use the data.
+ * This behavior is useful for RAID5/6 recovery code to verify recovered data.
+ *
+ * Return 0 if everything is OK.
+ * Return <0 if something goes wrong, and @scrub_ctx accounting will be updated
+ * if it's a data corruption.
+ */
+static int scrub_data_mirror(struct btrfs_fs_info *fs_info,
+			     struct btrfs_scrub_progress *scrub_ctx,
+			     char *data, u64 start, u64 len, int mirror)
+{
+	u64 cur = 0;
+	u32 csum;
+	u32 sectorsize = fs_info->tree_root->sectorsize;
+	char *buf = NULL;
+	int ret = 0;
+	int err = 0;
+
+	if (!data) {
+		buf = malloc(len);
+		if (!buf)
+			return -ENOMEM;
+		/* Read out as much data as possible to speed up read */
+		while (cur < len) {
+			u64 read_len = len - cur;
+
+			ret = read_extent_data(fs_info->tree_root, buf + cur,
+					start + cur, &read_len, mirror);
+			if (ret < 0) {
+				error("failed to read out data at logical bytenr %llu mirror %d",
+				      start + cur, mirror);
+				scrub_ctx->read_errors++;
+				goto out;
+			}
+			scrub_ctx->data_bytes_scrubbed += read_len;
+			cur += read_len;
+		}
+	} else {
+		buf = data;
+	}
+
+	/* Check csum per-sectorsize */
+	cur = 0;
+	while (cur < len) {
+		u32 data_csum = ~(u32)0;
+
+		ret = btrfs_read_one_data_csum(fs_info, start + cur, &csum);
+		if (ret > 0) {
+			scrub_ctx->csum_discards++;
+			ret = 0;
+
+			/* In case only some csum are missing */
+			goto next;
+		}
+		data_csum = btrfs_csum_data(NULL, buf + cur, data_csum,
+					    sectorsize);
+		btrfs_csum_final(data_csum, (u8 *)&data_csum);
+		if (data_csum != csum) {
+			error("data at bytenr %llu mirror %d csum mismatch, have %u expect %u",
+			      start + cur, mirror, data_csum, csum);
+			err = 1;
+			scrub_ctx->csum_errors++;
+			cur += sectorsize;
+			continue;
+		}
+		scrub_ctx->data_bytes_scrubbed += sectorsize;
+next:
+		cur += sectorsize;
+	}
+out:
+	if (!data)
+		free(buf);
+	if (!ret && err)
+		return -EIO;
+	return ret;
+}
-- 
2.11.0




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 12/19] btrfs-progs: scrub: Introduce function to scrub one extent
  2016-12-26  6:29 [PATCH v2 00/19] Qu Wenruo
                   ` (10 preceding siblings ...)
  2016-12-26  6:29 ` [PATCH v2 11/19] btrfs-progs: scrub: Introduce function to scrub mirror based data blocks Qu Wenruo
@ 2016-12-26  6:29 ` Qu Wenruo
  2016-12-26  6:29 ` [PATCH v2 13/19] btrfs-progs: scrub: Introduce function to scrub one data stripe Qu Wenruo
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Qu Wenruo @ 2016-12-26  6:29 UTC (permalink / raw)
  To: linux-btrfs

Introduce a new function, scrub_one_extent(), as a wrapper to check one
extent.

It will accept a btrfs_path parameter @path, which must points to a
META/EXTENT_ITEM.
And @start, @len, which must be a subset of META/EXTENT_ITEM.

Parameter @report will determine if we output error.
Since the function will be reused by RAID56 code, we want it able to be
silent.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 scrub.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)

diff --git a/scrub.c b/scrub.c
index 2563f407..f4ed0b78 100644
--- a/scrub.c
+++ b/scrub.c
@@ -272,3 +272,82 @@ out:
 		return -EIO;
 	return ret;
 }
+
+/*
+ * Check all copies of range @start, @len.
+ * Caller must ensure the range is covered by EXTENT_ITEM/METADATA_ITEM
+ * specified by leaf of @path.
+ * And @start, @len must be a subset of the EXTENT_ITEM/METADATA_ITEM.
+ *
+ * If @report is set, it will report if the range is recoverable or totally
+ * corrupted if it has corrupted mirror.
+ * This parameter is used for silent verification for RAID5/6 recovery code.
+ *
+ * Return 0 if the range is all OK or recoverable.
+ * Return <0 if the range can't be recoverable.
+ */
+static int scrub_one_extent(struct btrfs_fs_info *fs_info,
+			    struct btrfs_scrub_progress *scrub_ctx,
+			    struct btrfs_path *path, u64 start, u64 len,
+			    int report)
+{
+	struct btrfs_key key;
+	struct btrfs_extent_item *ei;
+	struct extent_buffer *leaf = path->nodes[0];
+	int slot = path->slots[0];
+	int num_copies;
+	int corrupted = 0;
+	u64 extent_start;
+	u64 extent_len;
+	int metadata = 0;
+	int i;
+	int ret;
+
+	btrfs_item_key_to_cpu(leaf, &key, slot);
+	if (key.type != BTRFS_METADATA_ITEM_KEY &&
+	    key.type != BTRFS_EXTENT_ITEM_KEY)
+		goto invalid_arg;
+
+	extent_start = key.objectid;
+	if (key.type == BTRFS_METADATA_ITEM_KEY) {
+		extent_len = fs_info->tree_root->nodesize;
+		metadata = 1;
+	} else {
+		extent_len = key.offset;
+		ei = btrfs_item_ptr(leaf, slot, struct btrfs_extent_item);
+		if (btrfs_extent_flags(leaf, ei) & BTRFS_EXTENT_FLAG_TREE_BLOCK)
+			metadata = 1;
+	}
+	if (start >= extent_start + extent_len ||
+	    start + len <= extent_start)
+		goto invalid_arg;
+	num_copies = btrfs_num_copies(&fs_info->mapping_tree, start, len);
+	for (i = 1; i <= num_copies; i++) {
+		if (metadata) {
+			ret = scrub_tree_mirror(fs_info, scrub_ctx,
+					NULL, extent_start, i);
+			scrub_ctx->tree_extents_scrubbed++;
+		} else {
+			ret = scrub_data_mirror(fs_info, scrub_ctx, NULL,
+						start, len, i);
+			scrub_ctx->data_extents_scrubbed++;
+		}
+		if (ret < 0)
+			corrupted++;
+	}
+
+	if (report) {
+		if (corrupted && corrupted < num_copies)
+			printf("bytenr %llu len %llu has corrupted mirror, but is recoverable\n",
+				start, len);
+		else if (corrupted >= num_copies)
+			error("bytenr %llu len %llu has corrupted mirror, can't be recovered",
+				start, len);
+	}
+	if (corrupted < num_copies)
+		return 0;
+	return -EIO;
+invalid_arg:
+	error("invalid parameter for %s", __func__);
+	return -EINVAL;
+}
-- 
2.11.0




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 13/19] btrfs-progs: scrub: Introduce function to scrub one data stripe
  2016-12-26  6:29 [PATCH v2 00/19] Qu Wenruo
                   ` (11 preceding siblings ...)
  2016-12-26  6:29 ` [PATCH v2 12/19] btrfs-progs: scrub: Introduce function to scrub one extent Qu Wenruo
@ 2016-12-26  6:29 ` Qu Wenruo
  2016-12-26  6:29 ` [PATCH v2 14/19] btrfs-progs: scrub: Introduce function to verify parities Qu Wenruo
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Qu Wenruo @ 2016-12-26  6:29 UTC (permalink / raw)
  To: linux-btrfs

Introduce new function, scrub_one_data_stripe(), to check all data and
tree blocks inside the data stripe.

This function will not try to recovery any error, but only check if any
data/tree blocks has mismatch csum.

If data missing csum, which is completely valid for case like nodatasum,
it will just record it, but not report as error.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 scrub.c | 129 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 129 insertions(+)

diff --git a/scrub.c b/scrub.c
index f4ed0b78..15a1955c 100644
--- a/scrub.c
+++ b/scrub.c
@@ -351,3 +351,132 @@ invalid_arg:
 	error("invalid parameter for %s", __func__);
 	return -EINVAL;
 }
+
+/*
+ * Scrub one full data stripe of RAID5/6.
+ * This means it will check any data/metadata extent in the data stripe
+ * spcified by @stripe and @stripe_len
+ *
+ * This function will only *CHECK* if the data stripe has any corruption.
+ * Won't repair at this function.
+ *
+ * Return 0 if the full stripe is OK.
+ * Return <0 if any error is found.
+ * Note: Missing csum is not counted as error(NODATACSUM is valid)
+ */
+static int scrub_one_data_stripe(struct btrfs_fs_info *fs_info,
+				 struct btrfs_scrub_progress *scrub_ctx,
+				 struct scrub_stripe *stripe, u32 stripe_len)
+{
+	struct btrfs_path *path;
+	struct btrfs_root *extent_root = fs_info->extent_root;
+	struct btrfs_key key;
+	u64 extent_start;
+	u64 extent_len;
+	u64 orig_csum_discards;
+	int ret;
+
+	if (!is_data_stripe(stripe))
+		return -EINVAL;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	key.objectid = stripe->logical + stripe_len;
+	key.offset = 0;
+	key.type = 0;
+
+	ret = btrfs_search_slot(NULL, extent_root, &key, path, 0, 0);
+	if (ret < 0)
+		goto out;
+	while (1) {
+		struct btrfs_extent_item *ei;
+		struct extent_buffer *eb;
+		char *data;
+		int slot;
+		int metadata = 0;
+		u64 check_start;
+		u64 check_len;
+
+		ret = btrfs_previous_extent_item(extent_root, path, 0);
+		if (ret > 0) {
+			ret = 0;
+			goto out;
+		}
+		if (ret < 0)
+			goto out;
+		eb = path->nodes[0];
+		slot = path->slots[0];
+		btrfs_item_key_to_cpu(eb, &key, slot);
+		extent_start = key.objectid;
+		ei = btrfs_item_ptr(eb, slot, struct btrfs_extent_item);
+
+		/* tree block scrub */
+		if (key.type == BTRFS_METADATA_ITEM_KEY ||
+		    btrfs_extent_flags(eb, ei) & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
+			extent_len = extent_root->nodesize;
+			metadata = 1;
+		} else {
+			extent_len = key.offset;
+			metadata = 0;
+		}
+
+		/* Current extent is out of our range, loop comes to end */
+		if (extent_start + extent_len <= stripe->logical)
+			break;
+
+		if (metadata) {
+			/*
+			 * Check crossing stripe first, which can't be scrubbed
+			 */
+			if (check_crossing_stripes(fs_info, extent_start,
+					extent_root->nodesize)) {
+				error("tree block at %llu is crossing stripe boundary, unable to scrub",
+					extent_start);
+				ret = -EIO;
+				goto out;
+			}
+			data = stripe->data + extent_start - stripe->logical;
+			ret = scrub_tree_mirror(fs_info, scrub_ctx,
+						data, extent_start, 0);
+			/* Any csum/verify error means the stripe is screwed */
+			if (ret < 0) {
+				stripe->csum_mismatch = 1;
+				ret = -EIO;
+				goto out;
+			}
+			ret = 0;
+			continue;
+		}
+		/* Restrict the extent range to fit stripe range */
+		check_start = max(extent_start, stripe->logical);
+		check_len = min(extent_start + extent_len, stripe->logical +
+				stripe_len) - check_start;
+
+		/* Record original csum_discards to detect missing csum case */
+		orig_csum_discards = scrub_ctx->csum_discards;
+
+		data = stripe->data + check_start - stripe->logical;
+		ret = scrub_data_mirror(fs_info, scrub_ctx, data, check_start,
+					check_len, 0);
+		/* Csum mismatch, no need to continue anyway*/
+		if (ret < 0) {
+			stripe->csum_mismatch = 1;
+			goto out;
+		}
+		/* Check if there is any missing csum for data */
+		if (scrub_ctx->csum_discards != orig_csum_discards)
+			stripe->csum_missing = 1;
+		/*
+		 * Only increase data_extents_scrubbed if we are scrubbing the
+		 * tailing part of the data extent
+		 */
+		if (extent_start + extent_len <= stripe->logical + stripe_len)
+			scrub_ctx->data_extents_scrubbed++;
+		ret = 0;
+	}
+out:
+	btrfs_free_path(path);
+	return ret;
+}
-- 
2.11.0




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 14/19] btrfs-progs: scrub: Introduce function to verify parities
  2016-12-26  6:29 [PATCH v2 00/19] Qu Wenruo
                   ` (12 preceding siblings ...)
  2016-12-26  6:29 ` [PATCH v2 13/19] btrfs-progs: scrub: Introduce function to scrub one data stripe Qu Wenruo
@ 2016-12-26  6:29 ` Qu Wenruo
  2016-12-26  6:29 ` [PATCH v2 15/19] btrfs-progs: extent-tree: Introduce function to check if there is any extent in given range Qu Wenruo
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Qu Wenruo @ 2016-12-26  6:29 UTC (permalink / raw)
  To: linux-btrfs

Introduce new function, verify_parities(), to check if parities matches
for full stripe which all data stripes matches with their csum.

Caller should fill the scrub_full_stripe structure properly before
calling this function.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 scrub.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/scrub.c b/scrub.c
index 15a1955c..238feb3c 100644
--- a/scrub.c
+++ b/scrub.c
@@ -25,6 +25,7 @@
 #include "volumes.h"
 #include "disk-io.h"
 #include "utils.h"
+#include "kernel-lib/raid56.h"
 
 /*
  * For parity based profile(RAID56)
@@ -480,3 +481,71 @@ out:
 	btrfs_free_path(path);
 	return ret;
 }
+
+/*
+ * Verify parities for RAID56
+ * Caller must fill @fstripe before calling this function
+ *
+ * Return 0 for parities matches.
+ * Return >0 for P or Q mismatch
+ * Return <0 for fatal error
+ */
+static int verify_parities(struct btrfs_fs_info *fs_info,
+			   struct btrfs_scrub_progress *scrub_ctx,
+			   struct scrub_full_stripe *fstripe)
+{
+	void **ptrs;
+	void *ondisk_p = NULL;
+	void *ondisk_q = NULL;
+	void *buf_p;
+	void *buf_q;
+	int nr_stripes = fstripe->nr_stripes;
+	int stripe_len = BTRFS_STRIPE_LEN;
+	int i;
+	int ret = 0;
+
+	ptrs = malloc(sizeof(void *) * fstripe->nr_stripes);
+	buf_p = malloc(fstripe->stripe_len);
+	buf_q = malloc(fstripe->stripe_len);
+	if (!ptrs || !buf_p || !buf_q) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	for (i = 0; i < fstripe->nr_stripes; i++) {
+		struct scrub_stripe *stripe = &fstripe->stripes[i];
+
+		if (stripe->logical == BTRFS_RAID5_P_STRIPE) {
+			ondisk_p = stripe->data;
+			ptrs[i] = buf_p;
+			continue;
+		} else if (stripe->logical == BTRFS_RAID6_Q_STRIPE) {
+			ondisk_q = stripe->data;
+			ptrs[i] = buf_q;
+			continue;
+		} else {
+			ptrs[i] = stripe->data;
+			continue;
+		}
+	}
+	/* RAID6 */
+	if (ondisk_q) {
+		raid6_gen_syndrome(nr_stripes, stripe_len, ptrs);
+
+		if (memcmp(ondisk_q, ptrs[nr_stripes - 1], stripe_len) != 0 ||
+		    memcmp(ondisk_p, ptrs[nr_stripes - 2], stripe_len))
+			ret = 1;
+	} else {
+		ret = raid5_gen_result(nr_stripes, stripe_len, nr_stripes - 1,
+					ptrs);
+		if (ret < 0)
+			goto out;
+		if (memcmp(ondisk_p, ptrs[nr_stripes - 1], stripe_len) != 0)
+			ret = 1;
+	}
+out:
+	free(buf_p);
+	free(buf_q);
+	free(ptrs);
+	return ret;
+}
-- 
2.11.0




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 15/19] btrfs-progs: extent-tree: Introduce function to check if there is any extent in given range.
  2016-12-26  6:29 [PATCH v2 00/19] Qu Wenruo
                   ` (13 preceding siblings ...)
  2016-12-26  6:29 ` [PATCH v2 14/19] btrfs-progs: scrub: Introduce function to verify parities Qu Wenruo
@ 2016-12-26  6:29 ` Qu Wenruo
  2016-12-26  6:29 ` [PATCH v2 16/19] btrfs-progs: scrub: Introduce function to recover data parity Qu Wenruo
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Qu Wenruo @ 2016-12-26  6:29 UTC (permalink / raw)
  To: linux-btrfs

Introduce a new function, btrfs_check_extent_exists(), to check if there
is any extent in the range specified by user.

The parameter can be a large range, and if any extent exists in the
range, it will return >0 (in fact it will return 1).
Or return 0 if no extent is found.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 ctree.h       |  2 ++
 extent-tree.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 62 insertions(+)

diff --git a/ctree.h b/ctree.h
index 506b107e..fe7c077e 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2561,6 +2561,8 @@ int exclude_super_stripes(struct btrfs_root *root,
 u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
 		       struct btrfs_fs_info *info, u64 start, u64 end);
 u64 hash_extent_data_ref(u64 root_objectid, u64 owner, u64 offset);
+int btrfs_check_extent_exists(struct btrfs_fs_info *fs_info, u64 start,
+			      u64 len);
 
 /* ctree.c */
 int btrfs_comp_cpu_keys(struct btrfs_key *k1, struct btrfs_key *k2);
diff --git a/extent-tree.c b/extent-tree.c
index b2847ff9..92868395 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -4256,3 +4256,63 @@ u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
 
 	return total_added;
 }
+
+/*
+ * Check if there is any extent(both data and metadata) in the range
+ * [@start, @start + @len)
+ *
+ * Return 0 for no extent found.
+ * Return >0 for found extent.
+ * Return <0 for fatal error.
+ */
+int btrfs_check_extent_exists(struct btrfs_fs_info *fs_info, u64 start,
+			      u64 len)
+{
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	u64 extent_start;
+	u64 extent_len;
+	int ret;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	key.objectid = start + len;
+	key.type = 0;
+	key.offset = 0;
+
+	ret = btrfs_search_slot(NULL, fs_info->extent_root, &key, path, 0, 0);
+	if (ret < 0)
+		goto out;
+	/*
+	 * Now we're pointing at slot whose key.object >= end, skip to previous
+	 * extent.
+	 */
+	ret = btrfs_previous_extent_item(fs_info->extent_root, path, 0);
+	if (ret < 0)
+		goto out;
+	if (ret > 0) {
+		ret = 0;
+		goto out;
+	}
+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+	extent_start = key.objectid;
+	if (key.type == BTRFS_METADATA_ITEM_KEY)
+		extent_len = fs_info->extent_root->nodesize;
+	else
+		extent_len = key.offset;
+
+	/*
+	 * search_slot() and previous_extent_item() has ensured that our
+	 * extent_start < start + len, we only need to care extent end.
+	 */
+	if (extent_start + extent_len <= start)
+		ret = 0;
+	else
+		ret = 1;
+
+out:
+	btrfs_free_path(path);
+	return ret;
+}
-- 
2.11.0




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 16/19] btrfs-progs: scrub: Introduce function to recover data parity
  2016-12-26  6:29 [PATCH v2 00/19] Qu Wenruo
                   ` (14 preceding siblings ...)
  2016-12-26  6:29 ` [PATCH v2 15/19] btrfs-progs: extent-tree: Introduce function to check if there is any extent in given range Qu Wenruo
@ 2016-12-26  6:29 ` Qu Wenruo
  2016-12-26  6:29 ` [PATCH v2 17/19] btrfs-progs: scrub: Introduce a function to scrub one full stripe Qu Wenruo
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Qu Wenruo @ 2016-12-26  6:29 UTC (permalink / raw)
  To: linux-btrfs

Introduce function, recover_from_parities(), to recover data stripes.

It just wraps raid56_recov() with extra check functions to
scrub_full_stripe structure.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 scrub.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/scrub.c b/scrub.c
index 238feb3c..522802c2 100644
--- a/scrub.c
+++ b/scrub.c
@@ -549,3 +549,54 @@ out:
 	free(ptrs);
 	return ret;
 }
+
+/*
+ * Try to recover data stripe from P or Q stripe
+ *
+ * Return >0 if it can't be require any more.
+ * Return 0 for successful repair or no need to repair at all
+ * Return <0 for fatal error
+ */
+static int recover_from_parities(struct btrfs_fs_info *fs_info,
+				  struct btrfs_scrub_progress *scrub_ctx,
+				  struct scrub_full_stripe *fstripe)
+{
+	void **ptrs;
+	int nr_stripes = fstripe->nr_stripes;
+	int stripe_len = BTRFS_STRIPE_LEN;
+	int max_tolerance;
+	int i;
+	int ret;
+
+	/* No need to recover */
+	if (!fstripe->nr_corrupted_stripes)
+		return 0;
+
+	/* Already recovered once, no more chance */
+	if (fstripe->recovered)
+		return 1;
+
+	if (fstripe->bg_type & BTRFS_BLOCK_GROUP_RAID5)
+		max_tolerance = 1;
+	else
+		max_tolerance = 2;
+
+	/* Out of repair */
+	if (fstripe->nr_corrupted_stripes > max_tolerance)
+		return 1;
+
+	ptrs = malloc(sizeof(void *) * fstripe->nr_stripes);
+	if (!ptrs)
+		return -ENOMEM;
+
+	/* Construct ptrs */
+	for (i = 0; i < nr_stripes; i++)
+		ptrs[i] = fstripe->stripes[i].data;
+
+	ret = raid56_recov(nr_stripes, stripe_len, fstripe->bg_type,
+			fstripe->corrupted_index[0],
+			fstripe->corrupted_index[1], ptrs);
+	fstripe->recovered = 1;
+	free(ptrs);
+	return ret;
+}
-- 
2.11.0




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 17/19] btrfs-progs: scrub: Introduce a function to scrub one full stripe
  2016-12-26  6:29 [PATCH v2 00/19] Qu Wenruo
                   ` (15 preceding siblings ...)
  2016-12-26  6:29 ` [PATCH v2 16/19] btrfs-progs: scrub: Introduce function to recover data parity Qu Wenruo
@ 2016-12-26  6:29 ` Qu Wenruo
  2016-12-26  6:29 ` [PATCH v2 18/19] btrfs-progs: scrub: Introduce function to check a whole block group Qu Wenruo
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Qu Wenruo @ 2016-12-26  6:29 UTC (permalink / raw)
  To: linux-btrfs

Introduce a new function, scrub_one_full_stripe(), to check a full
stripe.

It handles the full stripe scrub in the following steps:
0) Check if we need to check full stripe
   If full stripe contains no extent, why waste our CPU and IO?

1) Read out full stripe
   Then we know how many devices are missing or have read error.
   If out of repair, then exit

   If have missing device or have read error, try recover here.

2) Check data stripe against csum
   We add data stripe with csum error as corrupted stripe, just like
   dev missing or read error.
   Then recheck if csum mismatch is still below tolerance.

Finally we check the full stripe using 2 factors only:
A) If the full stripe go through recover ever
B) If the full stripe has csum error

Combine factor A and B we get:
1) A && B: Recovered, csum mismatch
   Screwed up totally
2) A && !B: Recovered, csum match
   Recoverable, data corrupted but P/Q is good to recover
3) !A && B: Not recovered, csum mismatch
   Try to recover corrupted data stripes
   If recovered csum match, then recoverable
   Else, screwed up
4) !A && !B: Not recovered, no csum mismatch
   Best case, just check if P/Q matches.
   If P/Q matches, everything is good
   Else, just P/Q is screwed up, still recoverable.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 scrub.c | 262 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 262 insertions(+)

diff --git a/scrub.c b/scrub.c
index 522802c2..bb94fa9f 100644
--- a/scrub.c
+++ b/scrub.c
@@ -600,3 +600,265 @@ static int recover_from_parities(struct btrfs_fs_info *fs_info,
 	free(ptrs);
 	return ret;
 }
+
+/*
+ * Return 0 if we still have chance to recover
+ * Return <0 if we have no more chance
+ */
+static int report_recoverablity(struct scrub_full_stripe *fstripe)
+{
+	int max_tolerance;
+	u64 start = fstripe->logical_start;
+
+	if (fstripe->bg_type & BTRFS_BLOCK_GROUP_RAID5)
+		max_tolerance = 1;
+	else
+		max_tolerance = 2;
+
+	if (fstripe->nr_corrupted_stripes > max_tolerance) {
+		error(
+	"full stripe %llu CORRUPTED: too many read error or corrupted devices",
+			start);
+		error(
+	"full stripe %llu: tolerance: %d, missing: %d, read error: %d, csum error: %d",
+			start, max_tolerance, fstripe->err_read_stripes,
+			fstripe->err_missing_devs, fstripe->err_csum_dstripes);
+		return -EIO;
+	}
+	return 0;
+}
+
+static void clear_corrupted_stripe_record(struct scrub_full_stripe *fstripe)
+{
+	fstripe->corrupted_index[0] = -1;
+	fstripe->corrupted_index[1] = -1;
+	fstripe->nr_corrupted_stripes = 0;
+}
+
+static void record_corrupted_stripe(struct scrub_full_stripe *fstripe,
+				    int index)
+{
+	int i = 0;
+
+	for (i = 0; i < 2; i++) {
+		if (fstripe->corrupted_index[i] == -1) {
+			fstripe->corrupted_index[i] = index;
+			break;
+		}
+	}
+	fstripe->nr_corrupted_stripes++;
+}
+
+/*
+ * Scrub one full stripe.
+ *
+ * If everything matches, that's good.
+ * If data stripe corrupted badly, no mean to recovery, it will report it.
+ * If data stripe corrupted, try recovery first and recheck csum, to
+ * determine if it's recoverable or screwed up.
+ */
+static int scrub_one_full_stripe(struct btrfs_fs_info *fs_info,
+				 struct btrfs_scrub_progress *scrub_ctx,
+				 u64 start, u64 *next_ret)
+{
+	struct scrub_full_stripe *fstripe;
+	struct btrfs_map_block *map_block = NULL;
+	u32 stripe_len = BTRFS_STRIPE_LEN;
+	u64 bg_type;
+	u64 len;
+	int i;
+	int ret;
+
+	if (!next_ret) {
+		error("invalid argument for %s", __func__);
+		return -EINVAL;
+	}
+
+	ret = __btrfs_map_block_v2(fs_info, WRITE, start, stripe_len,
+				   &map_block);
+	if (ret < 0) {
+		/* Let caller to skip the whole block group */
+		*next_ret = (u64)-1;
+		return ret;
+	}
+	start = map_block->start;
+	len = map_block->length;
+	*next_ret = start + len;
+
+	/*
+	 * Step 0: Check if we need to scrub the full stripe
+	 *
+	 * If no extent lies in the full stripe, not need to check
+	 */
+	ret = btrfs_check_extent_exists(fs_info, start, len);
+	if (ret < 0) {
+		free(map_block);
+		return ret;
+	}
+	/* No extents in range, no need to check */
+	if (ret == 0) {
+		free(map_block);
+		return 0;
+	}
+
+	bg_type = map_block->type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
+	if (bg_type != BTRFS_BLOCK_GROUP_RAID5 && 
+	    bg_type != BTRFS_BLOCK_GROUP_RAID6) {
+		free(map_block);
+		return -EINVAL;
+	}
+
+	fstripe = alloc_full_stripe(map_block->num_stripes,
+				    map_block->stripe_len);
+	if (!fstripe)
+		return -ENOMEM;
+
+	fstripe->logical_start = map_block->start;
+	fstripe->nr_stripes = map_block->num_stripes;
+	fstripe->stripe_len = stripe_len;
+	fstripe->bg_type = bg_type;
+
+	/*
+	 * Step 1: Read out the whole full stripe
+	 *
+	 * Then we have the chance to exit early if too many devices are
+	 * missing.
+	 */
+	for (i = 0; i < map_block->num_stripes; i++) {
+		struct scrub_stripe *s_stripe = &fstripe->stripes[i];
+		struct btrfs_map_stripe *m_stripe = &map_block->stripes[i];
+
+		s_stripe->logical = m_stripe->logical;
+
+		if (m_stripe->dev->fd == -1) {
+			s_stripe->dev_missing = 1;
+			record_corrupted_stripe(fstripe, i);
+			fstripe->err_missing_devs++;
+			continue;
+		}
+
+		ret = pread(m_stripe->dev->fd, s_stripe->data, stripe_len,
+			    m_stripe->physical);
+		if (ret < stripe_len) {
+			record_corrupted_stripe(fstripe, i);
+			fstripe->err_read_stripes++;
+			continue;
+		}
+	}
+
+	ret = report_recoverablity(fstripe);
+	if (ret < 0)
+		goto out;
+
+	ret = recover_from_parities(fs_info, scrub_ctx, fstripe);
+	if (ret < 0) {
+		error("full stripe %llu CORRUPTED: failed to recover: %s\n",
+		      fstripe->logical_start, strerror(-ret));
+		goto out;
+	}
+
+	/*
+	 * Clear corrupted stripes report, since they are recovered,
+	 * and later checker need to record csum mismatch stripes reusing
+	 * these members
+	 */
+	clear_corrupted_stripe_record(fstripe);
+
+	/*
+	 * Step 2: Check each data stripes against csum
+	 */
+	for (i = 0; i < map_block->num_stripes; i++) {
+		struct scrub_stripe *stripe = &fstripe->stripes[i];
+
+		if (!is_data_stripe(stripe))
+			continue;
+		ret = scrub_one_data_stripe(fs_info, scrub_ctx, stripe,
+					    stripe_len);
+		if (ret < 0) {
+			fstripe->err_csum_dstripes++;
+			record_corrupted_stripe(fstripe, i);
+		}
+	}
+
+	ret = report_recoverablity(fstripe);
+	if (ret < 0)
+		goto out;
+
+	/*
+	 * Recovered before, but no csum error
+	 */
+	if (fstripe->err_csum_dstripes == 0 && fstripe->recovered) {
+		error(
+		"full stripe %llu RECOVERABLE: P/Q is good for recovery",
+			start);
+		ret = 0;
+		goto out;
+	}
+	/*
+	 * No csum error, not recovered before.
+	 *
+	 * Only need to check if P/Q matches.
+	 */
+	if (fstripe->err_csum_dstripes == 0 && !fstripe->recovered) {
+		ret = verify_parities(fs_info, scrub_ctx, fstripe);
+		if (ret < 0)
+			error(
+		"full stripe %llu CORRUPTED: failed to check P/Q: %s",
+				start, strerror(-ret));
+		if (ret > 0) {
+			error(
+		"full stripe %llu RECOVERABLE: only P/Q is corrupted",
+				start);
+			ret = 0;
+		}
+		goto out;
+	}
+
+	/*
+	 * Still csum error after recovery
+	 *
+	 * No mean to fix further, screwed up already.
+	 */
+	if (fstripe->err_csum_dstripes && fstripe->recovered) {
+		error(
+	"full stripe %llu CORRUPTED: csum still mismatch after recovery",
+			start);
+		ret = -EIO;
+		goto out;
+	}
+	
+	/* Csum mismatch, but we still has chance to recover. */
+	ret = recover_from_parities(fs_info, scrub_ctx, fstripe);
+	if (ret < 0) {
+		error(
+	"full stripe %llu CORRUPTED: failed to recover: %s\n",
+			fstripe->logical_start, strerror(-ret));
+		goto out;
+	}
+
+	/* After recovery, recheck data stripe csum */
+	for (i = 0; i < 2; i++) {
+		int index = fstripe->corrupted_index[i];
+		struct scrub_stripe *stripe;
+
+		if (i == -1)
+			continue;
+		stripe = &fstripe->stripes[index];
+		ret = scrub_one_data_stripe(fs_info, scrub_ctx, stripe,
+					    stripe_len);
+		if (ret < 0) {
+			error(
+	"full stripe %llu CORRUPTED: csum still mismatch after recovery",
+				start);
+			goto out;
+		}
+	}
+	error(
+	"full stripe %llu RECOVERABLE: Data stripes corrupted, but P/Q is good",
+		start);
+
+out:
+	free_full_stripe(fstripe);
+	free(map_block);
+	return ret;
+}
-- 
2.11.0




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 18/19] btrfs-progs: scrub: Introduce function to check a whole block group
  2016-12-26  6:29 [PATCH v2 00/19] Qu Wenruo
                   ` (16 preceding siblings ...)
  2016-12-26  6:29 ` [PATCH v2 17/19] btrfs-progs: scrub: Introduce a function to scrub one full stripe Qu Wenruo
@ 2016-12-26  6:29 ` Qu Wenruo
  2016-12-26  6:29 ` [PATCH v2 19/19] btrfs-progs: fsck: Introduce offline scrub function Qu Wenruo
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Qu Wenruo @ 2016-12-26  6:29 UTC (permalink / raw)
  To: linux-btrfs

Introduce new function, scrub_one_block_group(), to scrub a block group.

For Single/DUP/RAID0/RAID1/RAID10, we use old mirror number based
map_block, and check extent by extent.

For parity based profile (RAID5/6), we use new map_block_v2() and check
full stripe by full stripe.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 scrub.c | 91 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 91 insertions(+)

diff --git a/scrub.c b/scrub.c
index bb94fa9f..8f122012 100644
--- a/scrub.c
+++ b/scrub.c
@@ -862,3 +862,94 @@ out:
 	free(map_block);
 	return ret;
 }
+
+/*
+ * Scrub one block group.
+ *
+ * This function will handle all profiles current btrfs supports.
+ * Return 0 for scrubbing the block group. Found error will be recorded into
+ * scrub_ctx.
+ * Return <0 for fatal error preventing scrubing the block group.
+ */
+static int scrub_one_block_group(struct btrfs_fs_info *fs_info,
+				 struct btrfs_scrub_progress *scrub_ctx,
+				 struct btrfs_block_group_cache *bg_cache)
+{
+	struct btrfs_root *extent_root = fs_info->extent_root;
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	u64 bg_start = bg_cache->key.objectid;
+	u64 bg_len = bg_cache->key.offset;
+	int ret;
+
+	if (bg_cache->flags &
+	    (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6)) {
+		u64 cur = bg_start;
+		u64 next;
+
+		while (cur < bg_start + bg_len) {
+			ret = scrub_one_full_stripe(fs_info, scrub_ctx, cur,
+						    &next);
+			/* Ignore any non-fatal error */
+			if (ret < 0 && ret != -EIO) {
+				error("fatal error happens checking one full stripe at bytenr: %llu: %s",
+					cur, strerror(-ret));
+				return ret;
+			}
+			cur = next;
+		}
+		/* Ignore any -EIO error, such error will be reported at last */
+		return 0;
+	}
+	/* None parity based profile, check extent by extent */
+	key.objectid = bg_start;
+	key.type = 0;
+	key.offset = 0;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+	ret = btrfs_search_slot(NULL, extent_root, &key, path, 0, 0);
+	if (ret < 0)
+		goto out;
+	while (1) {
+		struct extent_buffer *eb = path->nodes[0];
+		int slot = path->slots[0];
+		u64 extent_start;
+		u64 extent_len;
+
+		btrfs_item_key_to_cpu(eb, &key, slot);
+		if (key.objectid >= bg_start + bg_len)
+			break;
+		if (key.type != BTRFS_EXTENT_ITEM_KEY &&
+		    key.type != BTRFS_METADATA_ITEM_KEY)
+			goto next;
+
+		extent_start = key.objectid;
+		if (key.type == BTRFS_METADATA_ITEM_KEY)
+			extent_len = extent_root->nodesize;
+		else
+			extent_len = key.offset;
+
+		ret = scrub_one_extent(fs_info, scrub_ctx, path, extent_start,
+					extent_len, 1);
+		if (ret < 0 && ret != -EIO) {
+			error("fatal error checking extent bytenr %llu len %llu: %s",
+				extent_start, extent_len, strerror(-ret));
+			goto out;
+		}
+		ret = 0;
+next:
+		ret = btrfs_next_extent_item(extent_root, path, bg_start +
+					     bg_len);
+		if (ret < 0)
+			goto out;
+		if (ret > 0) {
+			ret = 0;
+			break;
+		}
+	}
+out:
+	btrfs_free_path(path);
+	return ret;
+}
-- 
2.11.0




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 19/19] btrfs-progs: fsck: Introduce offline scrub function
  2016-12-26  6:29 [PATCH v2 00/19] Qu Wenruo
                   ` (17 preceding siblings ...)
  2016-12-26  6:29 ` [PATCH v2 18/19] btrfs-progs: scrub: Introduce function to check a whole block group Qu Wenruo
@ 2016-12-26  6:29 ` Qu Wenruo
  2016-12-26  8:42 ` [PATCH v2 00/19] Btrfs offline scrub Qu Wenruo
  2016-12-29 18:15 ` [PATCH v2 00/19] Goffredo Baroncelli
  20 siblings, 0 replies; 27+ messages in thread
From: Qu Wenruo @ 2016-12-26  6:29 UTC (permalink / raw)
  To: linux-btrfs

Now, btrfs check has a kernel scrub equivalent.
A new option, --scrub is added for "btrfs check".

If --scrub is given, btrfs check will just act like kernel scrub, to
check every copy of extent and do a report on corrupted data and if it's
recoverable.

The advantage compare to kernel scrub is:
1) No race
   Unlike kernel scrub, which is done in parallel, offline scrub is done
   by a single thread.
   Although it may be slower than kernel one, it's safer and no false
   alert.

2) Correctness
   Kernel has a known bug (fix submitted) which will recovery RAID5/6
   data but screw up P/Q, due to the hardness coding in kernel.
   While in btrfs-progs, no page, (almost) no memory size limit, we're
   can focus on the scrub, and make things easier.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 Documentation/btrfs-check.asciidoc |  7 ++++++
 cmds-check.c                       | 12 +++++++++-
 ctree.h                            |  3 +++
 scrub.c                            | 49 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/Documentation/btrfs-check.asciidoc b/Documentation/btrfs-check.asciidoc
index 633cbbf6..d421afa4 100644
--- a/Documentation/btrfs-check.asciidoc
+++ b/Documentation/btrfs-check.asciidoc
@@ -91,6 +91,13 @@ the entire free space cache. This option with 'v2' provides an alternative
 method of clearing the free space cache that doesn't require mounting the
 filesystem.
 
+--scrub::
+kernel scrub equivalent.
++
+Off-line scrub has better reconstruction check than kernel. Won't cause
+possible silent data corruption for RAID5
++
+NOTE: Repair is not supported yet.
 
 DANGEROUS OPTIONS
 -----------------
diff --git a/cmds-check.c b/cmds-check.c
index 1dba2985..3a16a1ff 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -12588,6 +12588,7 @@ int cmd_check(int argc, char **argv)
 	int clear_space_cache = 0;
 	int qgroup_report = 0;
 	int qgroups_repaired = 0;
+	int scrub = 0;
 	unsigned ctree_flags = OPEN_CTREE_EXCLUSIVE;
 
 	while(1) {
@@ -12595,7 +12596,8 @@ int cmd_check(int argc, char **argv)
 		enum { GETOPT_VAL_REPAIR = 257, GETOPT_VAL_INIT_CSUM,
 			GETOPT_VAL_INIT_EXTENT, GETOPT_VAL_CHECK_CSUM,
 			GETOPT_VAL_READONLY, GETOPT_VAL_CHUNK_TREE,
-			GETOPT_VAL_MODE, GETOPT_VAL_CLEAR_SPACE_CACHE };
+			GETOPT_VAL_MODE, GETOPT_VAL_CLEAR_SPACE_CACHE,
+			GETOPT_VAL_SCRUB };
 		static const struct option long_options[] = {
 			{ "super", required_argument, NULL, 's' },
 			{ "repair", no_argument, NULL, GETOPT_VAL_REPAIR },
@@ -12617,6 +12619,7 @@ int cmd_check(int argc, char **argv)
 				GETOPT_VAL_MODE },
 			{ "clear-space-cache", required_argument, NULL,
 				GETOPT_VAL_CLEAR_SPACE_CACHE},
+			{ "scrub", no_argument, NULL, GETOPT_VAL_SCRUB },
 			{ NULL, 0, NULL, 0}
 		};
 
@@ -12701,6 +12704,9 @@ int cmd_check(int argc, char **argv)
 				}
 				ctree_flags |= OPEN_CTREE_WRITES;
 				break;
+			case GETOPT_VAL_SCRUB:
+				scrub = 1;
+				break;
 		}
 	}
 
@@ -12755,6 +12761,10 @@ int cmd_check(int argc, char **argv)
 
 	global_info = info;
 	root = info->fs_root;
+	if (scrub) {
+		ret = btrfs_scrub(info, repair);
+		goto err_out;
+	}
 	if (clear_space_cache == 1) {
 		if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE)) {
 			error(
diff --git a/ctree.h b/ctree.h
index fe7c077e..8f669ee7 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2803,4 +2803,7 @@ int btrfs_punch_hole(struct btrfs_trans_handle *trans,
 /* csum.c */
 int btrfs_read_one_data_csum(struct btrfs_fs_info *fs_info, u64 bytenr,
 			     void *csum_ret);
+
+/* scrub.c */
+int btrfs_scrub(struct btrfs_fs_info *fs_info, int repair);
 #endif
diff --git a/scrub.c b/scrub.c
index 8f122012..00662da0 100644
--- a/scrub.c
+++ b/scrub.c
@@ -953,3 +953,52 @@ out:
 	btrfs_free_path(path);
 	return ret;
 }
+
+int btrfs_scrub(struct btrfs_fs_info *fs_info, int repair)
+{
+	struct btrfs_block_group_cache *bg_cache;
+	struct btrfs_scrub_progress scrub_ctx = {0};
+	int ret = 0;
+
+	/*
+	 * TODO: To support repair, which should not be quite hard
+	 */
+	if (repair) {
+		error("Read-write scrub is not supported yet");
+		return 1;
+	}
+
+	bg_cache = btrfs_lookup_first_block_group(fs_info, 0);
+	if (!bg_cache) {
+		error("no block group is found");
+		return -ENOENT;
+	}
+
+	while (1) {
+		ret = scrub_one_block_group(fs_info, &scrub_ctx, bg_cache);
+		if (ret < 0 && ret != -EIO)
+			break;
+
+		bg_cache = btrfs_lookup_first_block_group(fs_info,
+				bg_cache->key.objectid + bg_cache->key.offset);
+		if (!bg_cache)
+			break;
+	}
+
+	printf("Scrub result:\n");
+	printf("Tree bytes scrubbed: %llu\n", scrub_ctx.tree_bytes_scrubbed);
+	printf("Tree extents scrubbed: %llu\n", scrub_ctx.tree_extents_scrubbed);
+	printf("Data bytes scrubbed: %llu\n", scrub_ctx.data_bytes_scrubbed);
+	printf("Data extents scrubbed: %llu\n", scrub_ctx.data_extents_scrubbed);
+	printf("Data bytes without csum: %llu\n", scrub_ctx.csum_discards *
+			fs_info->tree_root->sectorsize);
+	printf("Read error: %llu\n", scrub_ctx.read_errors);
+	printf("Verify error: %llu\n", scrub_ctx.verify_errors);
+	printf("Csum error: %llu\n", scrub_ctx.csum_errors);
+	if (scrub_ctx.csum_errors || scrub_ctx.read_errors ||
+	    scrub_ctx.uncorrectable_errors || scrub_ctx.verify_errors)
+		ret = 1;
+	else
+		ret = 0;
+	return ret;
+}
-- 
2.11.0




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 00/19] Btrfs offline scrub
  2016-12-26  6:29 [PATCH v2 00/19] Qu Wenruo
                   ` (18 preceding siblings ...)
  2016-12-26  6:29 ` [PATCH v2 19/19] btrfs-progs: fsck: Introduce offline scrub function Qu Wenruo
@ 2016-12-26  8:42 ` Qu Wenruo
  2016-12-29 18:15 ` [PATCH v2 00/19] Goffredo Baroncelli
  20 siblings, 0 replies; 27+ messages in thread
From: Qu Wenruo @ 2016-12-26  8:42 UTC (permalink / raw)
  To: linux-btrfs

Errr, I just forgot the title.

It's "btrfs offline scrub".

Thanks,
Qu

At 12/26/2016 02:29 PM, Qu Wenruo wrote:
> For any one who wants to try it, it can be get from my repo:
> https://github.com/adam900710/btrfs-progs/tree/offline_scrub
>
> Currently, I only tested it on SINGLE/DUP/RAID1/RAID5 filesystems, with
> mirror or parity or data corrupted.
> The tool are all able to detect them and give recoverbility report.
>
> Several reports on kernel scrub screwing up good data stripes are in ML
> for sometime.
>
> And since kernel scrub won't account P/Q corruption, it makes us quite
> to detect error like kernel screwing up P/Q when scrubbing.
>
> To get a comparable tool for kernel scrub, we need a user-space tool to
> act as benchmark to compare their different behaviors.
>
> So here is the patchset for user-space scrub.
>
> Which can do:
>
> 1) All mirror/backup check for non-parity based stripe
>    Which means for RAID1/DUP/RAID10, we can really check all mirrors
>    other than the 1st good mirror.
>
>    Current "--check-data-csum" option will be finally replace by scrub.
>    As it doesn't really check all mirrors, if it hits a good copy, then
>    resting copies will just be ignored.
>
> 2) Comprehensive RAID5/6 full stripe check
>    It will take full use of btrfs csum(both tree and data).
>    It will only recover the full stripe if all recovered data matches
>    with its csum.
>
> In fact, it can already expose several new btrfs kernel bug.
> As it's the main tool I'm using when developing the kernel fixes.
>
> For example, after screwing up a data stripe, kernel did repairs using
> parity, but recovered full stripe has wrong parity.
> Need to scrub again to fix it.
>
> And this patchset also introduced new map_block() function, which is
> more flex than current btrfs_map_block(), and has a unified interface
> for all profiles, not just an array of physical addresses.
>
> Check the 6th and 7th patch for details.
>
> They are already used in RAID5/6 scrub, but can also be used for other
> profiles too.
>
> The to-do list has been shortened, since RAID6 and new check logical is
> introduced.
> 1) Repair support
>    In fact, current tool can already report recoverability, repair is
>    not hard to implement.
>
> 2) Test cases
>    Need to make the infrastructure able to handle multi-device first.
>
> 3) Make btrfsck able to handle RAID5 with missing device
>    Now it doesn't even open RAID5 btrfs with missing device, even though
>    scrub should be able to handle it.
>
> Changelog:
> V0.8 RFC:
>    Initial RFC patchset
>
> v1:
>    First formal patchset.
>    RAID6 recovery support added, mainly copied from kernel radi6 lib.
>    Cleaner recovery logical.
>
> v2:
>    More comments in both code and commit message, suggested by David.
>    File re-arrangement, no check/ dir, raid56.ch moved to kernel-lib,
>    Suggested by David
>
> Qu Wenruo (19):
>   btrfs-progs: raid56: Introduce raid56 header for later recovery usage
>   btrfs-progs: raid56: Introduce tables for RAID6 recovery
>   btrfs-progs: raid56: Allow raid6 to recover 2 data stripes
>   btrfs-progs: raid56: Allow raid6 to recover data and p
>   btrfs-progs: Introduce wrapper to recover raid56 data
>   btrfs-progs: Introduce new btrfs_map_block function which returns more
>     unified result.
>   btrfs-progs: Allow __btrfs_map_block_v2 to remove unrelated stripes
>   btrfs-progs: csum: Introduce function to read out one data csum
>   btrfs-progs: scrub: Introduce structures to support fsck scrub for
>     RAID56
>   btrfs-progs: scrub: Introduce function to scrub mirror based tree
>     block
>   btrfs-progs: scrub: Introduce function to scrub mirror based data
>     blocks
>   btrfs-progs: scrub: Introduce function to scrub one extent
>   btrfs-progs: scrub: Introduce function to scrub one data stripe
>   btrfs-progs: scrub: Introduce function to verify parities
>   btrfs-progs: extent-tree: Introduce function to check if there is any
>     extent in given range.
>   btrfs-progs: scrub: Introduce function to recover data parity
>   btrfs-progs: scrub: Introduce a function to scrub one full stripe
>   btrfs-progs: scrub: Introduce function to check a whole block group
>   btrfs-progs: fsck: Introduce offline scrub function
>
>  .gitignore                         |    2 +
>  Documentation/btrfs-check.asciidoc |    7 +
>  Makefile.in                        |   19 +-
>  cmds-check.c                       |   12 +-
>  csum.c                             |   96 ++++
>  ctree.h                            |    8 +
>  disk-io.c                          |    4 +-
>  disk-io.h                          |    7 +-
>  extent-tree.c                      |   60 +++
>  kernel-lib/mktables.c              |  148 ++++++
>  kernel-lib/raid56.c                |  359 +++++++++++++
>  kernel-lib/raid56.h                |   58 +++
>  raid56.c                           |  172 ------
>  scrub.c                            | 1004 ++++++++++++++++++++++++++++++++++++
>  volumes.c                          |  283 ++++++++++
>  volumes.h                          |   49 ++
>  16 files changed, 2103 insertions(+), 185 deletions(-)
>  create mode 100644 csum.c
>  create mode 100644 kernel-lib/mktables.c
>  create mode 100644 kernel-lib/raid56.c
>  create mode 100644 kernel-lib/raid56.h
>  delete mode 100644 raid56.c
>  create mode 100644 scrub.c
>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 00/19]
  2016-12-26  6:29 [PATCH v2 00/19] Qu Wenruo
                   ` (19 preceding siblings ...)
  2016-12-26  8:42 ` [PATCH v2 00/19] Btrfs offline scrub Qu Wenruo
@ 2016-12-29 18:15 ` Goffredo Baroncelli
  2016-12-30  0:40   ` Qu Wenruo
  20 siblings, 1 reply; 27+ messages in thread
From: Goffredo Baroncelli @ 2016-12-29 18:15 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Hi Qu,

I tried your patch, because I had an hardware failure and I needed to check the data integrity. I didn't find any problem however I was not able to understand what "btrfs check --scrub" was doing because the program didn't give any output (there is no progress bar). So I tried to strace it to check if the program was working properly. The strace output showed me that the program ran correctly.
However form the strace I noticed that the program read several time the same page (size 16k). I think that this is due to the  walking of the btree. However this could be a possible optimization: cache the last read(s).

Only my 2¢

BR
G.Baroncelli



On 2016-12-26 07:29, Qu Wenruo wrote:
> For any one who wants to try it, it can be get from my repo:
> https://github.com/adam900710/btrfs-progs/tree/offline_scrub
> 
> Currently, I only tested it on SINGLE/DUP/RAID1/RAID5 filesystems, with
> mirror or parity or data corrupted.
> The tool are all able to detect them and give recoverbility report.
> 
> Several reports on kernel scrub screwing up good data stripes are in ML
> for sometime.
> 
> And since kernel scrub won't account P/Q corruption, it makes us quite
> to detect error like kernel screwing up P/Q when scrubbing.
> 
> To get a comparable tool for kernel scrub, we need a user-space tool to
> act as benchmark to compare their different behaviors.
> 
> So here is the patchset for user-space scrub.
> 
> Which can do:
> 
> 1) All mirror/backup check for non-parity based stripe
>    Which means for RAID1/DUP/RAID10, we can really check all mirrors
>    other than the 1st good mirror.
> 
>    Current "--check-data-csum" option will be finally replace by scrub.
>    As it doesn't really check all mirrors, if it hits a good copy, then
>    resting copies will just be ignored.
> 
> 2) Comprehensive RAID5/6 full stripe check
>    It will take full use of btrfs csum(both tree and data).
>    It will only recover the full stripe if all recovered data matches
>    with its csum.
> 
> In fact, it can already expose several new btrfs kernel bug.
> As it's the main tool I'm using when developing the kernel fixes.
> 
> For example, after screwing up a data stripe, kernel did repairs using
> parity, but recovered full stripe has wrong parity.
> Need to scrub again to fix it.
> 
> And this patchset also introduced new map_block() function, which is
> more flex than current btrfs_map_block(), and has a unified interface
> for all profiles, not just an array of physical addresses.
> 
> Check the 6th and 7th patch for details.
> 
> They are already used in RAID5/6 scrub, but can also be used for other
> profiles too.
> 
> The to-do list has been shortened, since RAID6 and new check logical is
> introduced.
> 1) Repair support
>    In fact, current tool can already report recoverability, repair is
>    not hard to implement.
> 
> 2) Test cases
>    Need to make the infrastructure able to handle multi-device first.
> 
> 3) Make btrfsck able to handle RAID5 with missing device
>    Now it doesn't even open RAID5 btrfs with missing device, even though
>    scrub should be able to handle it.
> 
> Changelog:
> V0.8 RFC:
>    Initial RFC patchset
> 
> v1:
>    First formal patchset.
>    RAID6 recovery support added, mainly copied from kernel radi6 lib.
>    Cleaner recovery logical.
> 
> v2:
>    More comments in both code and commit message, suggested by David.
>    File re-arrangement, no check/ dir, raid56.ch moved to kernel-lib,
>    Suggested by David
> 
> Qu Wenruo (19):
>   btrfs-progs: raid56: Introduce raid56 header for later recovery usage
>   btrfs-progs: raid56: Introduce tables for RAID6 recovery
>   btrfs-progs: raid56: Allow raid6 to recover 2 data stripes
>   btrfs-progs: raid56: Allow raid6 to recover data and p
>   btrfs-progs: Introduce wrapper to recover raid56 data
>   btrfs-progs: Introduce new btrfs_map_block function which returns more
>     unified result.
>   btrfs-progs: Allow __btrfs_map_block_v2 to remove unrelated stripes
>   btrfs-progs: csum: Introduce function to read out one data csum
>   btrfs-progs: scrub: Introduce structures to support fsck scrub for
>     RAID56
>   btrfs-progs: scrub: Introduce function to scrub mirror based tree
>     block
>   btrfs-progs: scrub: Introduce function to scrub mirror based data
>     blocks
>   btrfs-progs: scrub: Introduce function to scrub one extent
>   btrfs-progs: scrub: Introduce function to scrub one data stripe
>   btrfs-progs: scrub: Introduce function to verify parities
>   btrfs-progs: extent-tree: Introduce function to check if there is any
>     extent in given range.
>   btrfs-progs: scrub: Introduce function to recover data parity
>   btrfs-progs: scrub: Introduce a function to scrub one full stripe
>   btrfs-progs: scrub: Introduce function to check a whole block group
>   btrfs-progs: fsck: Introduce offline scrub function
> 
>  .gitignore                         |    2 +
>  Documentation/btrfs-check.asciidoc |    7 +
>  Makefile.in                        |   19 +-
>  cmds-check.c                       |   12 +-
>  csum.c                             |   96 ++++
>  ctree.h                            |    8 +
>  disk-io.c                          |    4 +-
>  disk-io.h                          |    7 +-
>  extent-tree.c                      |   60 +++
>  kernel-lib/mktables.c              |  148 ++++++
>  kernel-lib/raid56.c                |  359 +++++++++++++
>  kernel-lib/raid56.h                |   58 +++
>  raid56.c                           |  172 ------
>  scrub.c                            | 1004 ++++++++++++++++++++++++++++++++++++
>  volumes.c                          |  283 ++++++++++
>  volumes.h                          |   49 ++
>  16 files changed, 2103 insertions(+), 185 deletions(-)
>  create mode 100644 csum.c
>  create mode 100644 kernel-lib/mktables.c
>  create mode 100644 kernel-lib/raid56.c
>  create mode 100644 kernel-lib/raid56.h
>  delete mode 100644 raid56.c
>  create mode 100644 scrub.c
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 00/19]
  2016-12-29 18:15 ` [PATCH v2 00/19] Goffredo Baroncelli
@ 2016-12-30  0:40   ` Qu Wenruo
  2016-12-30 18:39     ` Goffredo Baroncelli
  0 siblings, 1 reply; 27+ messages in thread
From: Qu Wenruo @ 2016-12-30  0:40 UTC (permalink / raw)
  To: kreijack; +Cc: linux-btrfs

Hi Goffredo,

At 12/30/2016 02:15 AM, Goffredo Baroncelli wrote:
> Hi Qu,
>
> I tried your patch, because I had an hardware failure and I needed to check the data integrity.

I'm glad the function helps.

> I didn't find any problem however I was not able to understand what "btrfs check --scrub" was doing because the program didn't give any output (there is no progress bar).

Right, I should add a progress bar to it.
Maybe in next version along with repair function.

The good thing is, no output means good, just like normal fsck.

> So I tried to strace it to check if the program was working properly. The strace output showed me that the program ran correctly.
> However form the strace I noticed that the program read several time the same page (size 16k).
> I think that this is due to the  walking of the btree. However this could be a possible optimization: cache the last read(s).

That doesn't mean it's scrubbing the same leaf, but just normal tree search.

The leaf would be extent root or nodes near extent root.
The offline scrub heavily rely on extent tree to determine if there is 
any extent need to be scrubbed.

Further more, the idea to cache extent tree is not really that easy, 
according to what we have learned from btrfsck.
(Cache may go out of control to explode your RAM).


But your idea to cache still makes sense, for block-device, cache would 
always be good.
(For normal file, kernel provides cache so we don't need to implement by 
ourself)
Although that may need to be implemented in the ctree operation code 
instead of the offline scrub.

BTW, just for reference, what's your device size and how much time it 
takes to do the offline scrub?

Thanks,
Qu

>
> Only my 2¢
>
> BR
> G.Baroncelli
>
>
>
> On 2016-12-26 07:29, Qu Wenruo wrote:
>> For any one who wants to try it, it can be get from my repo:
>> https://github.com/adam900710/btrfs-progs/tree/offline_scrub
>>
>> Currently, I only tested it on SINGLE/DUP/RAID1/RAID5 filesystems, with
>> mirror or parity or data corrupted.
>> The tool are all able to detect them and give recoverbility report.
>>
>> Several reports on kernel scrub screwing up good data stripes are in ML
>> for sometime.
>>
>> And since kernel scrub won't account P/Q corruption, it makes us quite
>> to detect error like kernel screwing up P/Q when scrubbing.
>>
>> To get a comparable tool for kernel scrub, we need a user-space tool to
>> act as benchmark to compare their different behaviors.
>>
>> So here is the patchset for user-space scrub.
>>
>> Which can do:
>>
>> 1) All mirror/backup check for non-parity based stripe
>>    Which means for RAID1/DUP/RAID10, we can really check all mirrors
>>    other than the 1st good mirror.
>>
>>    Current "--check-data-csum" option will be finally replace by scrub.
>>    As it doesn't really check all mirrors, if it hits a good copy, then
>>    resting copies will just be ignored.
>>
>> 2) Comprehensive RAID5/6 full stripe check
>>    It will take full use of btrfs csum(both tree and data).
>>    It will only recover the full stripe if all recovered data matches
>>    with its csum.
>>
>> In fact, it can already expose several new btrfs kernel bug.
>> As it's the main tool I'm using when developing the kernel fixes.
>>
>> For example, after screwing up a data stripe, kernel did repairs using
>> parity, but recovered full stripe has wrong parity.
>> Need to scrub again to fix it.
>>
>> And this patchset also introduced new map_block() function, which is
>> more flex than current btrfs_map_block(), and has a unified interface
>> for all profiles, not just an array of physical addresses.
>>
>> Check the 6th and 7th patch for details.
>>
>> They are already used in RAID5/6 scrub, but can also be used for other
>> profiles too.
>>
>> The to-do list has been shortened, since RAID6 and new check logical is
>> introduced.
>> 1) Repair support
>>    In fact, current tool can already report recoverability, repair is
>>    not hard to implement.
>>
>> 2) Test cases
>>    Need to make the infrastructure able to handle multi-device first.
>>
>> 3) Make btrfsck able to handle RAID5 with missing device
>>    Now it doesn't even open RAID5 btrfs with missing device, even though
>>    scrub should be able to handle it.
>>
>> Changelog:
>> V0.8 RFC:
>>    Initial RFC patchset
>>
>> v1:
>>    First formal patchset.
>>    RAID6 recovery support added, mainly copied from kernel radi6 lib.
>>    Cleaner recovery logical.
>>
>> v2:
>>    More comments in both code and commit message, suggested by David.
>>    File re-arrangement, no check/ dir, raid56.ch moved to kernel-lib,
>>    Suggested by David
>>
>> Qu Wenruo (19):
>>   btrfs-progs: raid56: Introduce raid56 header for later recovery usage
>>   btrfs-progs: raid56: Introduce tables for RAID6 recovery
>>   btrfs-progs: raid56: Allow raid6 to recover 2 data stripes
>>   btrfs-progs: raid56: Allow raid6 to recover data and p
>>   btrfs-progs: Introduce wrapper to recover raid56 data
>>   btrfs-progs: Introduce new btrfs_map_block function which returns more
>>     unified result.
>>   btrfs-progs: Allow __btrfs_map_block_v2 to remove unrelated stripes
>>   btrfs-progs: csum: Introduce function to read out one data csum
>>   btrfs-progs: scrub: Introduce structures to support fsck scrub for
>>     RAID56
>>   btrfs-progs: scrub: Introduce function to scrub mirror based tree
>>     block
>>   btrfs-progs: scrub: Introduce function to scrub mirror based data
>>     blocks
>>   btrfs-progs: scrub: Introduce function to scrub one extent
>>   btrfs-progs: scrub: Introduce function to scrub one data stripe
>>   btrfs-progs: scrub: Introduce function to verify parities
>>   btrfs-progs: extent-tree: Introduce function to check if there is any
>>     extent in given range.
>>   btrfs-progs: scrub: Introduce function to recover data parity
>>   btrfs-progs: scrub: Introduce a function to scrub one full stripe
>>   btrfs-progs: scrub: Introduce function to check a whole block group
>>   btrfs-progs: fsck: Introduce offline scrub function
>>
>>  .gitignore                         |    2 +
>>  Documentation/btrfs-check.asciidoc |    7 +
>>  Makefile.in                        |   19 +-
>>  cmds-check.c                       |   12 +-
>>  csum.c                             |   96 ++++
>>  ctree.h                            |    8 +
>>  disk-io.c                          |    4 +-
>>  disk-io.h                          |    7 +-
>>  extent-tree.c                      |   60 +++
>>  kernel-lib/mktables.c              |  148 ++++++
>>  kernel-lib/raid56.c                |  359 +++++++++++++
>>  kernel-lib/raid56.h                |   58 +++
>>  raid56.c                           |  172 ------
>>  scrub.c                            | 1004 ++++++++++++++++++++++++++++++++++++
>>  volumes.c                          |  283 ++++++++++
>>  volumes.h                          |   49 ++
>>  16 files changed, 2103 insertions(+), 185 deletions(-)
>>  create mode 100644 csum.c
>>  create mode 100644 kernel-lib/mktables.c
>>  create mode 100644 kernel-lib/raid56.c
>>  create mode 100644 kernel-lib/raid56.h
>>  delete mode 100644 raid56.c
>>  create mode 100644 scrub.c
>>
>
>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 00/19]
  2016-12-30  0:40   ` Qu Wenruo
@ 2016-12-30 18:39     ` Goffredo Baroncelli
  2017-01-03  0:25       ` Qu Wenruo
  0 siblings, 1 reply; 27+ messages in thread
From: Goffredo Baroncelli @ 2016-12-30 18:39 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On 2016-12-30 01:40, Qu Wenruo wrote:
> Hi Goffredo,
[...]
>> So I tried to strace it to check if the program was working properly. The strace output showed me that the program ran correctly.
>> However form the strace I noticed that the program read several time the same page (size 16k).
>> I think that this is due to the  walking of the btree. However this could be a possible optimization: cache the last read(s).
> 
> That doesn't mean it's scrubbing the same leaf, but just normal tree search.
> 
> The leaf would be extent root or nodes near extent root.
> The offline scrub heavily rely on extent tree to determine if there is any extent need to be scrubbed.
> 
> Further more, the idea to cache extent tree is not really that easy, according to what we have learned from btrfsck.
> (Cache may go out of control to explode your RAM).

Let me to explain better; what I saw is several *sequential* reads of the *same block*: this is an excerpt of what I saw

[...]
read64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
[...]
 
Where both the offset and the size are equal. When I wrote about a cache, I am referring to something quite elementary like caching the last 3/4 reads, which could improve a lot the speed.

> 
> 
> But your idea to cache still makes sense, for block-device, cache would always be good.
> (For normal file, kernel provides cache so we don't need to implement by ourself)
> Although that may need to be implemented in the ctree operation code instead of the offline scrub.
> 
> BTW, just for reference, what's your device size and how much time it takes to do the offline scrub?

The disk is a 128GB ssd, where 25GB are occupied. I retested the scrub command in order to give you some data:

root@venice:/home/ghigo/btrfs/offline_scrub# echo 3 >/proc/sys/vm/drop_caches
root@venice:/home/ghigo/btrfs/offline_scrub# time ./btrfs check --scrub /dev/sda3
Scrub result:
Tree bytes scrubbed: 1108819968
Tree extents scrubbed: 135354
Data bytes scrubbed: 52708061184
Data extents scrubbed: 735767
Data bytes without csum: 235622400
Read error: 0
Verify error: 0
Csum error: 0

real    3m37.889s
user    1m43.060s
sys     0m39.416s

Instead, the kernel scrub requires:

root@venice:~# echo 3 >/proc/sys/vm/drop_caches
root@venice:~# time btrfs scrub start -rB /
scrub done for 931863a5-e0ab-4d90-aeae-af83e096bb64
        scrub started at Fri Dec 30 19:31:08 2016 and finished after 00:01:48
        total bytes scrubbed: 25.69GiB with 0 errors

real    1m48.171s
user    0m0.000s
sys     0m16.864s




Moreover, I had to explain a little trick which I used. Because this was my root filesystem, and I was lazy to start from another disk, I switched to single mode (systemctl isolate runleve1.target), I mounted the root filesystem RO (mount -o remount,ro /), and then I checked the disk (btrfs check --scrub). I have to point out that I removed some checks from btrfs, because it complaints that 
a) the filesystem was mounted (but in RO it would be safe)
b) it was not able to open the device in exclusive mode
To bypass these checks I made the following changes

diff --git a/cmds-check.c b/cmds-check.c
index 3a16a1f..fe5dee8 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -12589,7 +12589,7 @@ int cmd_check(int argc, char **argv)
        int qgroup_report = 0;
        int qgroups_repaired = 0;
        int scrub = 0;
-       unsigned ctree_flags = OPEN_CTREE_EXCLUSIVE;
+       unsigned ctree_flags = 0; /*OPEN_CTREE_EXCLUSIVE;*/
 
        while(1) {
                int c;
@@ -12735,7 +12735,7 @@ int cmd_check(int argc, char **argv)
        radix_tree_init();
        cache_tree_init(&root_cache);
 
-       if((ret = check_mounted(argv[optind])) < 0) {
+/*     if((ret = check_mounted(argv[optind])) < 0) {
                error("could not check mount status: %s", strerror(-ret));
                err |= !!ret;
                goto err_out;
@@ -12745,13 +12745,16 @@ int cmd_check(int argc, char **argv)
                err |= !!ret;
                goto err_out;
        }
-
+*/
+       ret = 0;
        /* only allow partial opening under repair mode */
        if (repair)
                ctree_flags |= OPEN_CTREE_PARTIAL;


Finally I switched to the normal state (systemctl isolate graphical.target)

> 
> Thanks,
> Qu

BR
G.Baroncelli

> 
>>
>> Only my 2¢
>>
>> BR
>> G.Baroncelli
>>
>>
>>
>> On 2016-12-26 07:29, Qu Wenruo wrote:
>>> For any one who wants to try it, it can be get from my repo:
>>> https://github.com/adam900710/btrfs-progs/tree/offline_scrub
>>>
>>> Currently, I only tested it on SINGLE/DUP/RAID1/RAID5 filesystems, with
>>> mirror or parity or data corrupted.
>>> The tool are all able to detect them and give recoverbility report.
>>>
>>> Several reports on kernel scrub screwing up good data stripes are in ML
>>> for sometime.
>>>
>>> And since kernel scrub won't account P/Q corruption, it makes us quite
>>> to detect error like kernel screwing up P/Q when scrubbing.
>>>
>>> To get a comparable tool for kernel scrub, we need a user-space tool to
>>> act as benchmark to compare their different behaviors.
>>>
>>> So here is the patchset for user-space scrub.
>>>
>>> Which can do:
>>>
>>> 1) All mirror/backup check for non-parity based stripe
>>>    Which means for RAID1/DUP/RAID10, we can really check all mirrors
>>>    other than the 1st good mirror.
>>>
>>>    Current "--check-data-csum" option will be finally replace by scrub.
>>>    As it doesn't really check all mirrors, if it hits a good copy, then
>>>    resting copies will just be ignored.
>>>
>>> 2) Comprehensive RAID5/6 full stripe check
>>>    It will take full use of btrfs csum(both tree and data).
>>>    It will only recover the full stripe if all recovered data matches
>>>    with its csum.
>>>
>>> In fact, it can already expose several new btrfs kernel bug.
>>> As it's the main tool I'm using when developing the kernel fixes.
>>>
>>> For example, after screwing up a data stripe, kernel did repairs using
>>> parity, but recovered full stripe has wrong parity.
>>> Need to scrub again to fix it.
>>>
>>> And this patchset also introduced new map_block() function, which is
>>> more flex than current btrfs_map_block(), and has a unified interface
>>> for all profiles, not just an array of physical addresses.
>>>
>>> Check the 6th and 7th patch for details.
>>>
>>> They are already used in RAID5/6 scrub, but can also be used for other
>>> profiles too.
>>>
>>> The to-do list has been shortened, since RAID6 and new check logical is
>>> introduced.
>>> 1) Repair support
>>>    In fact, current tool can already report recoverability, repair is
>>>    not hard to implement.
>>>
>>> 2) Test cases
>>>    Need to make the infrastructure able to handle multi-device first.
>>>
>>> 3) Make btrfsck able to handle RAID5 with missing device
>>>    Now it doesn't even open RAID5 btrfs with missing device, even though
>>>    scrub should be able to handle it.
>>>
>>> Changelog:
>>> V0.8 RFC:
>>>    Initial RFC patchset
>>>
>>> v1:
>>>    First formal patchset.
>>>    RAID6 recovery support added, mainly copied from kernel radi6 lib.
>>>    Cleaner recovery logical.
>>>
>>> v2:
>>>    More comments in both code and commit message, suggested by David.
>>>    File re-arrangement, no check/ dir, raid56.ch moved to kernel-lib,
>>>    Suggested by David
>>>
>>> Qu Wenruo (19):
>>>   btrfs-progs: raid56: Introduce raid56 header for later recovery usage
>>>   btrfs-progs: raid56: Introduce tables for RAID6 recovery
>>>   btrfs-progs: raid56: Allow raid6 to recover 2 data stripes
>>>   btrfs-progs: raid56: Allow raid6 to recover data and p
>>>   btrfs-progs: Introduce wrapper to recover raid56 data
>>>   btrfs-progs: Introduce new btrfs_map_block function which returns more
>>>     unified result.
>>>   btrfs-progs: Allow __btrfs_map_block_v2 to remove unrelated stripes
>>>   btrfs-progs: csum: Introduce function to read out one data csum
>>>   btrfs-progs: scrub: Introduce structures to support fsck scrub for
>>>     RAID56
>>>   btrfs-progs: scrub: Introduce function to scrub mirror based tree
>>>     block
>>>   btrfs-progs: scrub: Introduce function to scrub mirror based data
>>>     blocks
>>>   btrfs-progs: scrub: Introduce function to scrub one extent
>>>   btrfs-progs: scrub: Introduce function to scrub one data stripe
>>>   btrfs-progs: scrub: Introduce function to verify parities
>>>   btrfs-progs: extent-tree: Introduce function to check if there is any
>>>     extent in given range.
>>>   btrfs-progs: scrub: Introduce function to recover data parity
>>>   btrfs-progs: scrub: Introduce a function to scrub one full stripe
>>>   btrfs-progs: scrub: Introduce function to check a whole block group
>>>   btrfs-progs: fsck: Introduce offline scrub function
>>>
>>>  .gitignore                         |    2 +
>>>  Documentation/btrfs-check.asciidoc |    7 +
>>>  Makefile.in                        |   19 +-
>>>  cmds-check.c                       |   12 +-
>>>  csum.c                             |   96 ++++
>>>  ctree.h                            |    8 +
>>>  disk-io.c                          |    4 +-
>>>  disk-io.h                          |    7 +-
>>>  extent-tree.c                      |   60 +++
>>>  kernel-lib/mktables.c              |  148 ++++++
>>>  kernel-lib/raid56.c                |  359 +++++++++++++
>>>  kernel-lib/raid56.h                |   58 +++
>>>  raid56.c                           |  172 ------
>>>  scrub.c                            | 1004 ++++++++++++++++++++++++++++++++++++
>>>  volumes.c                          |  283 ++++++++++
>>>  volumes.h                          |   49 ++
>>>  16 files changed, 2103 insertions(+), 185 deletions(-)
>>>  create mode 100644 csum.c
>>>  create mode 100644 kernel-lib/mktables.c
>>>  create mode 100644 kernel-lib/raid56.c
>>>  create mode 100644 kernel-lib/raid56.h
>>>  delete mode 100644 raid56.c
>>>  create mode 100644 scrub.c
>>>
>>
>>
> 
> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 00/19]
  2016-12-30 18:39     ` Goffredo Baroncelli
@ 2017-01-03  0:25       ` Qu Wenruo
  0 siblings, 0 replies; 27+ messages in thread
From: Qu Wenruo @ 2017-01-03  0:25 UTC (permalink / raw)
  To: kreijack; +Cc: linux-btrfs



At 12/31/2016 02:39 AM, Goffredo Baroncelli wrote:
> On 2016-12-30 01:40, Qu Wenruo wrote:
>> Hi Goffredo,
> [...]
>>> So I tried to strace it to check if the program was working properly. The strace output showed me that the program ran correctly.
>>> However form the strace I noticed that the program read several time the same page (size 16k).
>>> I think that this is due to the  walking of the btree. However this could be a possible optimization: cache the last read(s).
>>
>> That doesn't mean it's scrubbing the same leaf, but just normal tree search.
>>
>> The leaf would be extent root or nodes near extent root.
>> The offline scrub heavily rely on extent tree to determine if there is any extent need to be scrubbed.
>>
>> Further more, the idea to cache extent tree is not really that easy, according to what we have learned from btrfsck.
>> (Cache may go out of control to explode your RAM).
>
> Let me to explain better; what I saw is several *sequential* reads of the *same block*: this is an excerpt of what I saw
>
> [...]
> read64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
> pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
> pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
> pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
> pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
> pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
> pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
> pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
> pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
> pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
> pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
> pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
> pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
> pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
> pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
> pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
> pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
> pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
> pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
> pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
> pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
> pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
> pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
> pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
> pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
> pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
> pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
> pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
> pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
> pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
> pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
> pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
> pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
> pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
> pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
> pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
> pread64(4, "\362<\t\357\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727066112) = 16384
> pread64(4, "\374\4\212\321\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 9727606784) = 16384
> [...]
>
> Where both the offset and the size are equal. When I wrote about a cache, I am referring to something quite elementary like caching the last 3/4 reads, which could improve a lot the speed.

That seems to be csum tree.
Since we are checking csum in sector size and I don't implement any 
speedup yet, so it will read ctree tree again and again, which I think 
that's part of the cause of slowness.

And since currently the offline scrub function itself seems to work 
quite well, it's time to enhance the speed and UI.

I'll address them all in next version.

Thanks,
Qu

>
>>
>>
>> But your idea to cache still makes sense, for block-device, cache would always be good.
>> (For normal file, kernel provides cache so we don't need to implement by ourself)
>> Although that may need to be implemented in the ctree operation code instead of the offline scrub.
>>
>> BTW, just for reference, what's your device size and how much time it takes to do the offline scrub?
>
> The disk is a 128GB ssd, where 25GB are occupied. I retested the scrub command in order to give you some data:
>
> root@venice:/home/ghigo/btrfs/offline_scrub# echo 3 >/proc/sys/vm/drop_caches
> root@venice:/home/ghigo/btrfs/offline_scrub# time ./btrfs check --scrub /dev/sda3
> Scrub result:
> Tree bytes scrubbed: 1108819968
> Tree extents scrubbed: 135354
> Data bytes scrubbed: 52708061184
> Data extents scrubbed: 735767
> Data bytes without csum: 235622400
> Read error: 0
> Verify error: 0
> Csum error: 0
>
> real    3m37.889s
> user    1m43.060s
> sys     0m39.416s
>
> Instead, the kernel scrub requires:
>
> root@venice:~# echo 3 >/proc/sys/vm/drop_caches
> root@venice:~# time btrfs scrub start -rB /
> scrub done for 931863a5-e0ab-4d90-aeae-af83e096bb64
>         scrub started at Fri Dec 30 19:31:08 2016 and finished after 00:01:48
>         total bytes scrubbed: 25.69GiB with 0 errors
>
> real    1m48.171s
> user    0m0.000s
> sys     0m16.864s
>
>
>
>
> Moreover, I had to explain a little trick which I used. Because this was my root filesystem, and I was lazy to start from another disk, I switched to single mode (systemctl isolate runleve1.target), I mounted the root filesystem RO (mount -o remount,ro /), and then I checked the disk (btrfs check --scrub). I have to point out that I removed some checks from btrfs, because it complaints that
> a) the filesystem was mounted (but in RO it would be safe)
> b) it was not able to open the device in exclusive mode
> To bypass these checks I made the following changes
>
> diff --git a/cmds-check.c b/cmds-check.c
> index 3a16a1f..fe5dee8 100644
> --- a/cmds-check.c
> +++ b/cmds-check.c
> @@ -12589,7 +12589,7 @@ int cmd_check(int argc, char **argv)
>         int qgroup_report = 0;
>         int qgroups_repaired = 0;
>         int scrub = 0;
> -       unsigned ctree_flags = OPEN_CTREE_EXCLUSIVE;
> +       unsigned ctree_flags = 0; /*OPEN_CTREE_EXCLUSIVE;*/
>
>         while(1) {
>                 int c;
> @@ -12735,7 +12735,7 @@ int cmd_check(int argc, char **argv)
>         radix_tree_init();
>         cache_tree_init(&root_cache);
>
> -       if((ret = check_mounted(argv[optind])) < 0) {
> +/*     if((ret = check_mounted(argv[optind])) < 0) {
>                 error("could not check mount status: %s", strerror(-ret));
>                 err |= !!ret;
>                 goto err_out;
> @@ -12745,13 +12745,16 @@ int cmd_check(int argc, char **argv)
>                 err |= !!ret;
>                 goto err_out;
>         }
> -
> +*/
> +       ret = 0;
>         /* only allow partial opening under repair mode */
>         if (repair)
>                 ctree_flags |= OPEN_CTREE_PARTIAL;
>
>
> Finally I switched to the normal state (systemctl isolate graphical.target)
>
>>
>> Thanks,
>> Qu
>
> BR
> G.Baroncelli
>
>>
>>>
>>> Only my 2¢
>>>
>>> BR
>>> G.Baroncelli
>>>
>>>
>>>
>>> On 2016-12-26 07:29, Qu Wenruo wrote:
>>>> For any one who wants to try it, it can be get from my repo:
>>>> https://github.com/adam900710/btrfs-progs/tree/offline_scrub
>>>>
>>>> Currently, I only tested it on SINGLE/DUP/RAID1/RAID5 filesystems, with
>>>> mirror or parity or data corrupted.
>>>> The tool are all able to detect them and give recoverbility report.
>>>>
>>>> Several reports on kernel scrub screwing up good data stripes are in ML
>>>> for sometime.
>>>>
>>>> And since kernel scrub won't account P/Q corruption, it makes us quite
>>>> to detect error like kernel screwing up P/Q when scrubbing.
>>>>
>>>> To get a comparable tool for kernel scrub, we need a user-space tool to
>>>> act as benchmark to compare their different behaviors.
>>>>
>>>> So here is the patchset for user-space scrub.
>>>>
>>>> Which can do:
>>>>
>>>> 1) All mirror/backup check for non-parity based stripe
>>>>    Which means for RAID1/DUP/RAID10, we can really check all mirrors
>>>>    other than the 1st good mirror.
>>>>
>>>>    Current "--check-data-csum" option will be finally replace by scrub.
>>>>    As it doesn't really check all mirrors, if it hits a good copy, then
>>>>    resting copies will just be ignored.
>>>>
>>>> 2) Comprehensive RAID5/6 full stripe check
>>>>    It will take full use of btrfs csum(both tree and data).
>>>>    It will only recover the full stripe if all recovered data matches
>>>>    with its csum.
>>>>
>>>> In fact, it can already expose several new btrfs kernel bug.
>>>> As it's the main tool I'm using when developing the kernel fixes.
>>>>
>>>> For example, after screwing up a data stripe, kernel did repairs using
>>>> parity, but recovered full stripe has wrong parity.
>>>> Need to scrub again to fix it.
>>>>
>>>> And this patchset also introduced new map_block() function, which is
>>>> more flex than current btrfs_map_block(), and has a unified interface
>>>> for all profiles, not just an array of physical addresses.
>>>>
>>>> Check the 6th and 7th patch for details.
>>>>
>>>> They are already used in RAID5/6 scrub, but can also be used for other
>>>> profiles too.
>>>>
>>>> The to-do list has been shortened, since RAID6 and new check logical is
>>>> introduced.
>>>> 1) Repair support
>>>>    In fact, current tool can already report recoverability, repair is
>>>>    not hard to implement.
>>>>
>>>> 2) Test cases
>>>>    Need to make the infrastructure able to handle multi-device first.
>>>>
>>>> 3) Make btrfsck able to handle RAID5 with missing device
>>>>    Now it doesn't even open RAID5 btrfs with missing device, even though
>>>>    scrub should be able to handle it.
>>>>
>>>> Changelog:
>>>> V0.8 RFC:
>>>>    Initial RFC patchset
>>>>
>>>> v1:
>>>>    First formal patchset.
>>>>    RAID6 recovery support added, mainly copied from kernel radi6 lib.
>>>>    Cleaner recovery logical.
>>>>
>>>> v2:
>>>>    More comments in both code and commit message, suggested by David.
>>>>    File re-arrangement, no check/ dir, raid56.ch moved to kernel-lib,
>>>>    Suggested by David
>>>>
>>>> Qu Wenruo (19):
>>>>   btrfs-progs: raid56: Introduce raid56 header for later recovery usage
>>>>   btrfs-progs: raid56: Introduce tables for RAID6 recovery
>>>>   btrfs-progs: raid56: Allow raid6 to recover 2 data stripes
>>>>   btrfs-progs: raid56: Allow raid6 to recover data and p
>>>>   btrfs-progs: Introduce wrapper to recover raid56 data
>>>>   btrfs-progs: Introduce new btrfs_map_block function which returns more
>>>>     unified result.
>>>>   btrfs-progs: Allow __btrfs_map_block_v2 to remove unrelated stripes
>>>>   btrfs-progs: csum: Introduce function to read out one data csum
>>>>   btrfs-progs: scrub: Introduce structures to support fsck scrub for
>>>>     RAID56
>>>>   btrfs-progs: scrub: Introduce function to scrub mirror based tree
>>>>     block
>>>>   btrfs-progs: scrub: Introduce function to scrub mirror based data
>>>>     blocks
>>>>   btrfs-progs: scrub: Introduce function to scrub one extent
>>>>   btrfs-progs: scrub: Introduce function to scrub one data stripe
>>>>   btrfs-progs: scrub: Introduce function to verify parities
>>>>   btrfs-progs: extent-tree: Introduce function to check if there is any
>>>>     extent in given range.
>>>>   btrfs-progs: scrub: Introduce function to recover data parity
>>>>   btrfs-progs: scrub: Introduce a function to scrub one full stripe
>>>>   btrfs-progs: scrub: Introduce function to check a whole block group
>>>>   btrfs-progs: fsck: Introduce offline scrub function
>>>>
>>>>  .gitignore                         |    2 +
>>>>  Documentation/btrfs-check.asciidoc |    7 +
>>>>  Makefile.in                        |   19 +-
>>>>  cmds-check.c                       |   12 +-
>>>>  csum.c                             |   96 ++++
>>>>  ctree.h                            |    8 +
>>>>  disk-io.c                          |    4 +-
>>>>  disk-io.h                          |    7 +-
>>>>  extent-tree.c                      |   60 +++
>>>>  kernel-lib/mktables.c              |  148 ++++++
>>>>  kernel-lib/raid56.c                |  359 +++++++++++++
>>>>  kernel-lib/raid56.h                |   58 +++
>>>>  raid56.c                           |  172 ------
>>>>  scrub.c                            | 1004 ++++++++++++++++++++++++++++++++++++
>>>>  volumes.c                          |  283 ++++++++++
>>>>  volumes.h                          |   49 ++
>>>>  16 files changed, 2103 insertions(+), 185 deletions(-)
>>>>  create mode 100644 csum.c
>>>>  create mode 100644 kernel-lib/mktables.c
>>>>  create mode 100644 kernel-lib/raid56.c
>>>>  create mode 100644 kernel-lib/raid56.h
>>>>  delete mode 100644 raid56.c
>>>>  create mode 100644 scrub.c
>>>>
>>>
>>>
>>
>>
>>
>
>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 06/19] btrfs-progs: Introduce new btrfs_map_block function which returns more unified result.
  2016-12-26  6:29 ` [PATCH v2 06/19] btrfs-progs: Introduce new btrfs_map_block function which returns more unified result Qu Wenruo
@ 2017-02-24  0:37   ` Liu Bo
  2017-02-24  0:45     ` Qu Wenruo
  0 siblings, 1 reply; 27+ messages in thread
From: Liu Bo @ 2017-02-24  0:37 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Dec 26, 2016 at 02:29:26PM +0800, Qu Wenruo wrote:
> Introduce a new function, __btrfs_map_block_v2().
> 
> Unlike old btrfs_map_block(), which needs different parameter to handle
> different RAID profile, this new function uses unified btrfs_map_block
> structure to handle all RAID profile in a more meaningful method:
> 
> Return physical address along with logical address for each stripe.
> 
> For RAID1/Single/DUP (none-stripped):
> result would be like:
> Map block: Logical 128M, Len 10M, Type RAID1, Stripe len 0, Nr_stripes 2
> Stripe 0: Logical 128M, Physical X, Len: 10M Dev dev1
> Stripe 1: Logical 128M, Physical Y, Len: 10M Dev dev2
> 
> Result will be as long as possible, since it's not stripped at all.
> 
> For RAID0/10 (stripped without parity):
> Result will be aligned to full stripe size:
> Map block: Logical 64K, Len 128K, Type RAID10, Stripe len 64K, Nr_stripes 4
> Stripe 0: Logical 64K, Physical X, Len 64K Dev dev1
> Stripe 1: Logical 64K, Physical Y, Len 64K Dev dev2
> Stripe 2: Logical 128K, Physical Z, Len 64K Dev dev3
> Stripe 3: Logical 128K, Physical W, Len 64K Dev dev4
> 
> For RAID5/6 (stripped with parity and dev-rotation)
> Result will be aligned to full stripe size:
> Map block: Logical 64K, Len 128K, Type RAID6, Stripe len 64K, Nr_stripes 4
> Stripe 0: Logical 64K, Physical X, Len 64K Dev dev1
> Stripe 1: Logical 128K, Physical Y, Len 64K Dev dev2
> Stripe 2: Logical RAID5_P, Physical Z, Len 64K Dev dev3
> Stripe 3: Logical RAID6_Q, Physical W, Len 64K Dev dev4
> 
> The new unified layout should be very flex and can even handle things
> like N-way RAID1 (which old mirror_num basic one can't handle well).
> 
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
>  volumes.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  volumes.h |  49 +++++++++++++++++
>  2 files changed, 230 insertions(+)
> 
> diff --git a/volumes.c b/volumes.c
> index f17bdeed..11d1f0e8 100644
> --- a/volumes.c
> +++ b/volumes.c
> @@ -1593,6 +1593,187 @@ out:
>  	return 0;
>  }
>  
> +static inline struct btrfs_map_block *alloc_map_block(int num_stripes)
> +{
> +	struct btrfs_map_block *ret;
> +	int size;
> +
> +	size = sizeof(struct btrfs_map_stripe) * num_stripes +
> +		sizeof(struct btrfs_map_block);
> +	ret = malloc(size);
> +	if (!ret)
> +		return NULL;
> +	memset(ret, 0, size);
> +	return ret;
> +}
> +
> +static int fill_full_map_block(struct map_lookup *map, u64 start, u64 length,
> +			       struct btrfs_map_block *map_block)
> +{
> +	u64 profile = map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
> +	u64 bg_start = map->ce.start;
> +	u64 bg_end = bg_start + map->ce.size;
> +	u64 bg_offset = start - bg_start; /* offset inside the block group */
> +	u64 fstripe_logical = 0;	/* Full stripe start logical bytenr */
> +	u64 fstripe_size = 0;		/* Full stripe logical size */
> +	u64 fstripe_phy_off = 0;	/* Full stripe offset in each dev */
> +	u32 stripe_len = map->stripe_len;
> +	int sub_stripes = map->sub_stripes;
> +	int data_stripes = nr_data_stripes(map);
> +	int dev_rotation;
> +	int i;
> +
> +	map_block->num_stripes = map->num_stripes;
> +	map_block->type = profile;
> +
> +	/*
> +	 * Common full stripe data for stripe based profiles
> +	 */
> +	if (profile & (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID10 |
> +		       BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6)) {
> +		fstripe_size = stripe_len * data_stripes;
> +		if (sub_stripes)
> +			fstripe_size /= sub_stripes;
> +		fstripe_logical = bg_offset / fstripe_size * fstripe_size +
> +				    bg_start;
> +		fstripe_phy_off = bg_offset / fstripe_size * stripe_len;
> +	}
> +
> +	switch (profile) {
> +	case BTRFS_BLOCK_GROUP_DUP:
> +	case BTRFS_BLOCK_GROUP_RAID1:
> +	case 0: /* SINGLE */
> +		/*
> +		 * None-stripe mode,(Single, DUP and RAID1)
> +		 * Just use offset to fill map_block
> +		 */
> +		map_block->stripe_len = 0;
> +		map_block->start = start;
> +		map_block->length = min(bg_end, start + length) - start;
> +		for (i = 0; i < map->num_stripes; i++) {
> +			struct btrfs_map_stripe *stripe;
> +
> +			stripe = &map_block->stripes[i];
> +
> +			stripe->dev = map->stripes[i].dev;
> +			stripe->logical = start;
> +			stripe->physical = map->stripes[i].physical + bg_offset;
> +			stripe->length = map_block->length;
> +		}
> +		break;
> +	case BTRFS_BLOCK_GROUP_RAID10:
> +	case BTRFS_BLOCK_GROUP_RAID0:
> +		/*
> +		 * Stripe modes without parity(0 and 10)
> +		 * Return the whole full stripe
> +		 */
> +
> +		map_block->start = fstripe_logical;
> +		map_block->length = fstripe_size;
> +		map_block->stripe_len = map->stripe_len;
> +		for (i = 0; i < map->num_stripes; i++) {
> +			struct btrfs_map_stripe *stripe;
> +			u64 cur_offset;
> +
> +			/* Handle RAID10 sub stripes */
> +			if (sub_stripes)
> +				cur_offset = i / sub_stripes * stripe_len;
> +			else
> +				cur_offset = stripe_len * i;
> +			stripe = &map_block->stripes[i];
> +
> +			stripe->dev = map->stripes[i].dev;
> +			stripe->logical = fstripe_logical + cur_offset;
> +			stripe->length = stripe_len;
> +			stripe->physical = map->stripes[i].physical +
> +					   fstripe_phy_off;

Looks like @fstripe_phy_off refers to the start offset of the stripe on devices,
but we may ask for an offset inside the stripe.

Thanks,

-liubo

> +		}
> +		break;
> +	case BTRFS_BLOCK_GROUP_RAID5:
> +	case BTRFS_BLOCK_GROUP_RAID6:
> +		/*
> +		 * Stripe modes with parity and device rotation(5 and 6)
> +		 *
> +		 * Return the whole full stripe
> +		 */
> +
> +		dev_rotation = (bg_offset / fstripe_size) % map->num_stripes;
> +
> +		map_block->start = fstripe_logical;
> +		map_block->length = fstripe_size;
> +		map_block->stripe_len = map->stripe_len;
> +		for (i = 0; i < map->num_stripes; i++) {
> +			struct btrfs_map_stripe *stripe;
> +			int dest_index;
> +			u64 cur_offset = stripe_len * i;
> +
> +			stripe = &map_block->stripes[i];
> +
> +			dest_index = (i + dev_rotation) % map->num_stripes;
> +			stripe->dev = map->stripes[dest_index].dev;
> +			stripe->length = stripe_len;
> +			stripe->physical = map->stripes[dest_index].physical +
> +					   fstripe_phy_off;
> +			if (i < data_stripes) {
> +				/* data stripe */
> +				stripe->logical = fstripe_logical +
> +						  cur_offset;
> +			} else if (i == data_stripes) {
> +				/* P */
> +				stripe->logical = BTRFS_RAID5_P_STRIPE;
> +			} else {
> +				/* Q */
> +				stripe->logical = BTRFS_RAID6_Q_STRIPE;
> +			}
> +		}
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
> +int __btrfs_map_block_v2(struct btrfs_fs_info *fs_info, int rw, u64 logical,
> +			 u64 length, struct btrfs_map_block **map_ret)
> +{
> +	struct cache_extent *ce;
> +	struct map_lookup *map;
> +	struct btrfs_map_block *map_block;
> +	int ret;
> +
> +	/* Eearly parameter check */
> +	if (!length || !map_ret) {
> +		error("wrong parameter for %s", __func__);
> +		return -EINVAL;
> +	}
> +
> +	ce = search_cache_extent(&fs_info->mapping_tree.cache_tree, logical);
> +	if (!ce)
> +		return -ENOENT;
> +	if (ce->start > logical)
> +		return -ENOENT;
> +
> +	map = container_of(ce, struct map_lookup, ce);
> +	/*
> +	 * Allocate a full map_block anyway
> +	 *
> +	 * For write, we need the full map_block anyway.
> +	 * For read, it will be striped to the needed stripe before returning.
> +	 */
> +	map_block = alloc_map_block(map->num_stripes);
> +	if (!map_block)
> +		return -ENOMEM;
> +	ret = fill_full_map_block(map, logical, length, map_block);
> +	if (ret < 0) {
> +		free(map_block);
> +		return ret;
> +	}
> +	/* TODO: Remove unrelated map_stripes for READ operation */
> +
> +	*map_ret = map_block;
> +	return 0;
> +}
> +
>  struct btrfs_device *btrfs_find_device(struct btrfs_root *root, u64 devid,
>  				       u8 *uuid, u8 *fsid)
>  {
> diff --git a/volumes.h b/volumes.h
> index ee7d56ab..0a575557 100644
> --- a/volumes.h
> +++ b/volumes.h
> @@ -108,6 +108,51 @@ struct map_lookup {
>  	struct btrfs_bio_stripe stripes[];
>  };
>  
> +struct btrfs_map_stripe {
> +	struct btrfs_device *dev;
> +
> +	/*
> +	 * Logical address of the stripe start.
> +	 * Caller should check if this logical is the desired map start.
> +	 * It's possible that the logical is smaller or larger than desired
> +	 * map range.
> +	 *
> +	 * For P/Q stipre, it will be BTRFS_RAID5_P_STRIPE
> +	 * and BTRFS_RAID6_Q_STRIPE.
> +	 */
> +	u64 logical;
> +
> +	u64 physical;
> +
> +	/* The length of the stripe */
> +	u64 length;
> +};
> +
> +struct btrfs_map_block {
> +	/*
> +	 * The logical start of the whole map block.
> +	 * For RAID5/6 it will be the bytenr of the full stripe start,
> +	 * so it's possible that @start is smaller than desired map range
> +	 * start.
> +	 */
> +	u64 start;
> +
> +	/*
> +	 * The logical length of the map block.
> +	 * For RAID5/6 it will be total data stripe size
> +	 */
> +	u64 length;
> +
> +	/* Block group type */
> +	u64 type;
> +
> +	/* Stripe length, for non-stripped mode, it will be 0 */
> +	u32 stripe_len;
> +
> +	int num_stripes;
> +	struct btrfs_map_stripe stripes[];
> +};
> +
>  #define btrfs_multi_bio_size(n) (sizeof(struct btrfs_multi_bio) + \
>  			    (sizeof(struct btrfs_bio_stripe) * (n)))
>  #define btrfs_map_lookup_size(n) (sizeof(struct map_lookup) + \
> @@ -187,6 +232,10 @@ int btrfs_map_block(struct btrfs_mapping_tree *map_tree, int rw,
>  		    u64 logical, u64 *length,
>  		    struct btrfs_multi_bio **multi_ret, int mirror_num,
>  		    u64 **raid_map_ret);
> +
> +/* TODO: Use this map_block_v2 to replace __btrfs_map_block() */
> +int __btrfs_map_block_v2(struct btrfs_fs_info *fs_info, int rw, u64 logical,
> +			 u64 length, struct btrfs_map_block **map_ret);
>  int btrfs_next_bg(struct btrfs_mapping_tree *map_tree, u64 *logical,
>  		     u64 *size, u64 type);
>  static inline int btrfs_next_bg_metadata(struct btrfs_mapping_tree *map_tree,
> -- 
> 2.11.0
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 06/19] btrfs-progs: Introduce new btrfs_map_block function which returns more unified result.
  2017-02-24  0:37   ` Liu Bo
@ 2017-02-24  0:45     ` Qu Wenruo
  0 siblings, 0 replies; 27+ messages in thread
From: Qu Wenruo @ 2017-02-24  0:45 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs



At 02/24/2017 08:37 AM, Liu Bo wrote:
> On Mon, Dec 26, 2016 at 02:29:26PM +0800, Qu Wenruo wrote:
>> Introduce a new function, __btrfs_map_block_v2().
>>
>> Unlike old btrfs_map_block(), which needs different parameter to handle
>> different RAID profile, this new function uses unified btrfs_map_block
>> structure to handle all RAID profile in a more meaningful method:
>>
>> Return physical address along with logical address for each stripe.
>>
>> For RAID1/Single/DUP (none-stripped):
>> result would be like:
>> Map block: Logical 128M, Len 10M, Type RAID1, Stripe len 0, Nr_stripes 2
>> Stripe 0: Logical 128M, Physical X, Len: 10M Dev dev1
>> Stripe 1: Logical 128M, Physical Y, Len: 10M Dev dev2
>>
>> Result will be as long as possible, since it's not stripped at all.
>>
>> For RAID0/10 (stripped without parity):
>> Result will be aligned to full stripe size:
>> Map block: Logical 64K, Len 128K, Type RAID10, Stripe len 64K, Nr_stripes 4
>> Stripe 0: Logical 64K, Physical X, Len 64K Dev dev1
>> Stripe 1: Logical 64K, Physical Y, Len 64K Dev dev2
>> Stripe 2: Logical 128K, Physical Z, Len 64K Dev dev3
>> Stripe 3: Logical 128K, Physical W, Len 64K Dev dev4
>>
>> For RAID5/6 (stripped with parity and dev-rotation)
>> Result will be aligned to full stripe size:
>> Map block: Logical 64K, Len 128K, Type RAID6, Stripe len 64K, Nr_stripes 4
>> Stripe 0: Logical 64K, Physical X, Len 64K Dev dev1
>> Stripe 1: Logical 128K, Physical Y, Len 64K Dev dev2
>> Stripe 2: Logical RAID5_P, Physical Z, Len 64K Dev dev3
>> Stripe 3: Logical RAID6_Q, Physical W, Len 64K Dev dev4
>>
>> The new unified layout should be very flex and can even handle things
>> like N-way RAID1 (which old mirror_num basic one can't handle well).
>>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> ---
>>  volumes.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  volumes.h |  49 +++++++++++++++++
>>  2 files changed, 230 insertions(+)
>>
>> diff --git a/volumes.c b/volumes.c
>> index f17bdeed..11d1f0e8 100644
>> --- a/volumes.c
>> +++ b/volumes.c
>> @@ -1593,6 +1593,187 @@ out:
>>  	return 0;
>>  }
>>
>> +static inline struct btrfs_map_block *alloc_map_block(int num_stripes)
>> +{
>> +	struct btrfs_map_block *ret;
>> +	int size;
>> +
>> +	size = sizeof(struct btrfs_map_stripe) * num_stripes +
>> +		sizeof(struct btrfs_map_block);
>> +	ret = malloc(size);
>> +	if (!ret)
>> +		return NULL;
>> +	memset(ret, 0, size);
>> +	return ret;
>> +}
>> +
>> +static int fill_full_map_block(struct map_lookup *map, u64 start, u64 length,
>> +			       struct btrfs_map_block *map_block)
>> +{
>> +	u64 profile = map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
>> +	u64 bg_start = map->ce.start;
>> +	u64 bg_end = bg_start + map->ce.size;
>> +	u64 bg_offset = start - bg_start; /* offset inside the block group */
>> +	u64 fstripe_logical = 0;	/* Full stripe start logical bytenr */
>> +	u64 fstripe_size = 0;		/* Full stripe logical size */
>> +	u64 fstripe_phy_off = 0;	/* Full stripe offset in each dev */
>> +	u32 stripe_len = map->stripe_len;
>> +	int sub_stripes = map->sub_stripes;
>> +	int data_stripes = nr_data_stripes(map);
>> +	int dev_rotation;
>> +	int i;
>> +
>> +	map_block->num_stripes = map->num_stripes;
>> +	map_block->type = profile;
>> +
>> +	/*
>> +	 * Common full stripe data for stripe based profiles
>> +	 */
>> +	if (profile & (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID10 |
>> +		       BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6)) {
>> +		fstripe_size = stripe_len * data_stripes;
>> +		if (sub_stripes)
>> +			fstripe_size /= sub_stripes;
>> +		fstripe_logical = bg_offset / fstripe_size * fstripe_size +
>> +				    bg_start;
>> +		fstripe_phy_off = bg_offset / fstripe_size * stripe_len;
>> +	}
>> +
>> +	switch (profile) {
>> +	case BTRFS_BLOCK_GROUP_DUP:
>> +	case BTRFS_BLOCK_GROUP_RAID1:
>> +	case 0: /* SINGLE */
>> +		/*
>> +		 * None-stripe mode,(Single, DUP and RAID1)
>> +		 * Just use offset to fill map_block
>> +		 */
>> +		map_block->stripe_len = 0;
>> +		map_block->start = start;
>> +		map_block->length = min(bg_end, start + length) - start;
>> +		for (i = 0; i < map->num_stripes; i++) {
>> +			struct btrfs_map_stripe *stripe;
>> +
>> +			stripe = &map_block->stripes[i];
>> +
>> +			stripe->dev = map->stripes[i].dev;
>> +			stripe->logical = start;
>> +			stripe->physical = map->stripes[i].physical + bg_offset;
>> +			stripe->length = map_block->length;
>> +		}
>> +		break;
>> +	case BTRFS_BLOCK_GROUP_RAID10:
>> +	case BTRFS_BLOCK_GROUP_RAID0:
>> +		/*
>> +		 * Stripe modes without parity(0 and 10)
>> +		 * Return the whole full stripe
>> +		 */
>> +
>> +		map_block->start = fstripe_logical;
>> +		map_block->length = fstripe_size;
>> +		map_block->stripe_len = map->stripe_len;
>> +		for (i = 0; i < map->num_stripes; i++) {
>> +			struct btrfs_map_stripe *stripe;
>> +			u64 cur_offset;
>> +
>> +			/* Handle RAID10 sub stripes */
>> +			if (sub_stripes)
>> +				cur_offset = i / sub_stripes * stripe_len;
>> +			else
>> +				cur_offset = stripe_len * i;
>> +			stripe = &map_block->stripes[i];
>> +
>> +			stripe->dev = map->stripes[i].dev;
>> +			stripe->logical = fstripe_logical + cur_offset;
>> +			stripe->length = stripe_len;
>> +			stripe->physical = map->stripes[i].physical +
>> +					   fstripe_phy_off;
>
> Looks like @fstripe_phy_off refers to the start offset of the stripe on devices,
> but we may ask for an offset inside the stripe.

Yes, that's designed. To make the __btrfs_map_block_v2() itself to only 
care about stripe boundary, and keep it simple.

And in next patch, I introduced an easy function to modify the stripe to 
desired range, and remove unrelated stripes.
Thanks to the new stripe structure which has both physical and logical 
address, we don't need to introduce the complex logic in 
__btrfs_map_block_v2().

Thanks,
Qu
>
> Thanks,
>
> -liubo
>
>> +		}
>> +		break;
>> +	case BTRFS_BLOCK_GROUP_RAID5:
>> +	case BTRFS_BLOCK_GROUP_RAID6:
>> +		/*
>> +		 * Stripe modes with parity and device rotation(5 and 6)
>> +		 *
>> +		 * Return the whole full stripe
>> +		 */
>> +
>> +		dev_rotation = (bg_offset / fstripe_size) % map->num_stripes;
>> +
>> +		map_block->start = fstripe_logical;
>> +		map_block->length = fstripe_size;
>> +		map_block->stripe_len = map->stripe_len;
>> +		for (i = 0; i < map->num_stripes; i++) {
>> +			struct btrfs_map_stripe *stripe;
>> +			int dest_index;
>> +			u64 cur_offset = stripe_len * i;
>> +
>> +			stripe = &map_block->stripes[i];
>> +
>> +			dest_index = (i + dev_rotation) % map->num_stripes;
>> +			stripe->dev = map->stripes[dest_index].dev;
>> +			stripe->length = stripe_len;
>> +			stripe->physical = map->stripes[dest_index].physical +
>> +					   fstripe_phy_off;
>> +			if (i < data_stripes) {
>> +				/* data stripe */
>> +				stripe->logical = fstripe_logical +
>> +						  cur_offset;
>> +			} else if (i == data_stripes) {
>> +				/* P */
>> +				stripe->logical = BTRFS_RAID5_P_STRIPE;
>> +			} else {
>> +				/* Q */
>> +				stripe->logical = BTRFS_RAID6_Q_STRIPE;
>> +			}
>> +		}
>> +		break;
>> +	default:
>> +		return -EINVAL;
>> +	}
>> +	return 0;
>> +}
>> +
>> +int __btrfs_map_block_v2(struct btrfs_fs_info *fs_info, int rw, u64 logical,
>> +			 u64 length, struct btrfs_map_block **map_ret)
>> +{
>> +	struct cache_extent *ce;
>> +	struct map_lookup *map;
>> +	struct btrfs_map_block *map_block;
>> +	int ret;
>> +
>> +	/* Eearly parameter check */
>> +	if (!length || !map_ret) {
>> +		error("wrong parameter for %s", __func__);
>> +		return -EINVAL;
>> +	}
>> +
>> +	ce = search_cache_extent(&fs_info->mapping_tree.cache_tree, logical);
>> +	if (!ce)
>> +		return -ENOENT;
>> +	if (ce->start > logical)
>> +		return -ENOENT;
>> +
>> +	map = container_of(ce, struct map_lookup, ce);
>> +	/*
>> +	 * Allocate a full map_block anyway
>> +	 *
>> +	 * For write, we need the full map_block anyway.
>> +	 * For read, it will be striped to the needed stripe before returning.
>> +	 */
>> +	map_block = alloc_map_block(map->num_stripes);
>> +	if (!map_block)
>> +		return -ENOMEM;
>> +	ret = fill_full_map_block(map, logical, length, map_block);
>> +	if (ret < 0) {
>> +		free(map_block);
>> +		return ret;
>> +	}
>> +	/* TODO: Remove unrelated map_stripes for READ operation */
>> +
>> +	*map_ret = map_block;
>> +	return 0;
>> +}
>> +
>>  struct btrfs_device *btrfs_find_device(struct btrfs_root *root, u64 devid,
>>  				       u8 *uuid, u8 *fsid)
>>  {
>> diff --git a/volumes.h b/volumes.h
>> index ee7d56ab..0a575557 100644
>> --- a/volumes.h
>> +++ b/volumes.h
>> @@ -108,6 +108,51 @@ struct map_lookup {
>>  	struct btrfs_bio_stripe stripes[];
>>  };
>>
>> +struct btrfs_map_stripe {
>> +	struct btrfs_device *dev;
>> +
>> +	/*
>> +	 * Logical address of the stripe start.
>> +	 * Caller should check if this logical is the desired map start.
>> +	 * It's possible that the logical is smaller or larger than desired
>> +	 * map range.
>> +	 *
>> +	 * For P/Q stipre, it will be BTRFS_RAID5_P_STRIPE
>> +	 * and BTRFS_RAID6_Q_STRIPE.
>> +	 */
>> +	u64 logical;
>> +
>> +	u64 physical;
>> +
>> +	/* The length of the stripe */
>> +	u64 length;
>> +};
>> +
>> +struct btrfs_map_block {
>> +	/*
>> +	 * The logical start of the whole map block.
>> +	 * For RAID5/6 it will be the bytenr of the full stripe start,
>> +	 * so it's possible that @start is smaller than desired map range
>> +	 * start.
>> +	 */
>> +	u64 start;
>> +
>> +	/*
>> +	 * The logical length of the map block.
>> +	 * For RAID5/6 it will be total data stripe size
>> +	 */
>> +	u64 length;
>> +
>> +	/* Block group type */
>> +	u64 type;
>> +
>> +	/* Stripe length, for non-stripped mode, it will be 0 */
>> +	u32 stripe_len;
>> +
>> +	int num_stripes;
>> +	struct btrfs_map_stripe stripes[];
>> +};
>> +
>>  #define btrfs_multi_bio_size(n) (sizeof(struct btrfs_multi_bio) + \
>>  			    (sizeof(struct btrfs_bio_stripe) * (n)))
>>  #define btrfs_map_lookup_size(n) (sizeof(struct map_lookup) + \
>> @@ -187,6 +232,10 @@ int btrfs_map_block(struct btrfs_mapping_tree *map_tree, int rw,
>>  		    u64 logical, u64 *length,
>>  		    struct btrfs_multi_bio **multi_ret, int mirror_num,
>>  		    u64 **raid_map_ret);
>> +
>> +/* TODO: Use this map_block_v2 to replace __btrfs_map_block() */
>> +int __btrfs_map_block_v2(struct btrfs_fs_info *fs_info, int rw, u64 logical,
>> +			 u64 length, struct btrfs_map_block **map_ret);
>>  int btrfs_next_bg(struct btrfs_mapping_tree *map_tree, u64 *logical,
>>  		     u64 *size, u64 type);
>>  static inline int btrfs_next_bg_metadata(struct btrfs_mapping_tree *map_tree,
>> --
>> 2.11.0
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2017-02-24  0:45 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-26  6:29 [PATCH v2 00/19] Qu Wenruo
2016-12-26  6:29 ` [PATCH v2 01/19] btrfs-progs: raid56: Introduce raid56 header for later recovery usage Qu Wenruo
2016-12-26  6:29 ` [PATCH v2 02/19] btrfs-progs: raid56: Introduce tables for RAID6 recovery Qu Wenruo
2016-12-26  6:29 ` [PATCH v2 03/19] btrfs-progs: raid56: Allow raid6 to recover 2 data stripes Qu Wenruo
2016-12-26  6:29 ` [PATCH v2 04/19] btrfs-progs: raid56: Allow raid6 to recover data and p Qu Wenruo
2016-12-26  6:29 ` [PATCH v2 05/19] btrfs-progs: Introduce wrapper to recover raid56 data Qu Wenruo
2016-12-26  6:29 ` [PATCH v2 06/19] btrfs-progs: Introduce new btrfs_map_block function which returns more unified result Qu Wenruo
2017-02-24  0:37   ` Liu Bo
2017-02-24  0:45     ` Qu Wenruo
2016-12-26  6:29 ` [PATCH v2 07/19] btrfs-progs: Allow __btrfs_map_block_v2 to remove unrelated stripes Qu Wenruo
2016-12-26  6:29 ` [PATCH v2 08/19] btrfs-progs: csum: Introduce function to read out one data csum Qu Wenruo
2016-12-26  6:29 ` [PATCH v2 09/19] btrfs-progs: scrub: Introduce structures to support fsck scrub for RAID56 Qu Wenruo
2016-12-26  6:29 ` [PATCH v2 10/19] btrfs-progs: scrub: Introduce function to scrub mirror based tree block Qu Wenruo
2016-12-26  6:29 ` [PATCH v2 11/19] btrfs-progs: scrub: Introduce function to scrub mirror based data blocks Qu Wenruo
2016-12-26  6:29 ` [PATCH v2 12/19] btrfs-progs: scrub: Introduce function to scrub one extent Qu Wenruo
2016-12-26  6:29 ` [PATCH v2 13/19] btrfs-progs: scrub: Introduce function to scrub one data stripe Qu Wenruo
2016-12-26  6:29 ` [PATCH v2 14/19] btrfs-progs: scrub: Introduce function to verify parities Qu Wenruo
2016-12-26  6:29 ` [PATCH v2 15/19] btrfs-progs: extent-tree: Introduce function to check if there is any extent in given range Qu Wenruo
2016-12-26  6:29 ` [PATCH v2 16/19] btrfs-progs: scrub: Introduce function to recover data parity Qu Wenruo
2016-12-26  6:29 ` [PATCH v2 17/19] btrfs-progs: scrub: Introduce a function to scrub one full stripe Qu Wenruo
2016-12-26  6:29 ` [PATCH v2 18/19] btrfs-progs: scrub: Introduce function to check a whole block group Qu Wenruo
2016-12-26  6:29 ` [PATCH v2 19/19] btrfs-progs: fsck: Introduce offline scrub function Qu Wenruo
2016-12-26  8:42 ` [PATCH v2 00/19] Btrfs offline scrub Qu Wenruo
2016-12-29 18:15 ` [PATCH v2 00/19] Goffredo Baroncelli
2016-12-30  0:40   ` Qu Wenruo
2016-12-30 18:39     ` Goffredo Baroncelli
2017-01-03  0:25       ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.