All of lore.kernel.org
 help / color / mirror / Atom feed
* [BTRFS-PROGS][PATCH][V2] Add two new commands: 'btrfs insp physical-find' and 'btrfs insp physical-dump'
@ 2016-07-27 17:43 Goffredo Baroncelli
  2016-07-27 17:43 ` [PATCH 1/5] Add some helper functions Goffredo Baroncelli
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Goffredo Baroncelli @ 2016-07-27 17:43 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, Chris Mason, Qu Wenruo


Hi all,

the following patches add two new commands: 
1) btrfs inspect-internal physical-find
2) btrfs inspect-internal physical-dump

The aim of these two new commands is to locate (1) and dump (2) the stripe elements
stored on the disks. I developed these two new command to simplify the
debugging of some RAID5 bugs (but this is another discussion).

An example of 'btrfs inspect-internal physical-find' is the following:

# btrfs inspect physical-find mnt/out.txt
mnt/out.txt: 0
        devid: 3 dev_name: /dev/loop2 offset: 61931520 type: DATA
        devid: 2 dev_name: /dev/loop1 offset: 61931520 type: OTHER
        devid: 1 dev_name: /dev/loop0 offset: 81854464 type: PARITY
        devid: 4 dev_name: /dev/loop3 offset: 61931520 type: PARITY

In the output above, DATA is the stripe elemnt conaining data. OTHER
is the sibling stripe elemnt: it contains data related to or other files
or to the same file but different position. The two stripe elements contain
the RAID6 parity (P and Q).

It is possible to pass the offset of the file to inspect.

An example of 'btrfs inspect-internal physical-dump' is the following

# btrfs insp physical-find mnt/out.txt 
mnt/out.txt: 0
devid: 5 dev_name: /dev/loop4 offset: 56819712 type: OTHER
devid: 4 dev_name: /dev/loop3 offset: 56819712 type: OTHER
devid: 3 dev_name: /dev/loop2 offset: 56819712 type: DATA
devid: 2 dev_name: /dev/loop1 offset: 56819712 type: PARITY
devid: 1 dev_name: /dev/loop0 offset: 76742656 type: PARITY

# btrfs insp physical-dump mnt/out.txt | xxd 
mnt/out.txt: 0
file: /dev/loop2 off=56819712
00000000: 6164 6161 6161 6161 6161 6161 6161 6161  adaaaaaaaaaaaaaa
00000010: 6161 6161 6161 6161 6161 6161 6161 6161  aaaaaaaaaaaaaaaa
00000020: 6161 6161 6161 6161 6161 6161 6161 6161  aaaaaaaaaaaaaaaa
00000030: 6161 6161 6161 6161 6161 6161 6161 6161  aaaaaaaaaaaaaaaa
00000040: 6161 6161 6161 6161 6161 6161 6161 6161  aaaaaaaaaaaaaaaa
[...]

In this case it is dumped the content of the first 4k of the file. It 
is possible to pass also an offset (at step of 4k). Moreover
it is possible to select to dump: which copy has to be dumped (switch -c,
only for RAID1/RAID10/DUP); which parity has to be dumped (switch -p,
only for RAID5/RAID6); which stripe element other than data (switch -s,
only for RAID5/RAID6).

ChangeLog:

v1: 2016-07-24 First issue
v2: 2016-07-27 After Qu suggestion, it is added the switch '-l' to dump
the info from a "logical" address

BR
G.Baroncelli





^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/5] Add some helper functions
  2016-07-27 17:43 [BTRFS-PROGS][PATCH][V2] Add two new commands: 'btrfs insp physical-find' and 'btrfs insp physical-dump' Goffredo Baroncelli
@ 2016-07-27 17:43 ` Goffredo Baroncelli
  2016-07-28  1:03   ` Qu Wenruo
  2016-07-27 17:43 ` [PATCH 2/5] New btrfs command: "btrfs inspect physical-find" Goffredo Baroncelli
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Goffredo Baroncelli @ 2016-07-27 17:43 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, Chris Mason, Qu Wenruo, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Add the following functions:
- int is_btrfs_fs(const char *path) -> returns 0 if path is a btrfs filesystem
- void check_root_or_exit() -> checks if the user has the root capability or
                               it exits writing an error message
- void check_btrfs_or_exit(const char *path)
				checks if path is a valid btrfs filesystem,
				otherwise it exits

Signed-off-by: Goffredo baroncelli <kreijack@inwind.it>
---
 utils.c | 41 +++++++++++++++++++++++++++++++++++++++++
 utils.h | 14 ++++++++++++++
 2 files changed, 55 insertions(+)

diff --git a/utils.c b/utils.c
index 578fdb0..b99706c 100644
--- a/utils.c
+++ b/utils.c
@@ -4131,3 +4131,44 @@ unsigned int rand_range(unsigned int upper)
 	 */
 	return (unsigned int)(jrand48(rand_seed) % upper);
 }
+
+/*
+ * check if path is a btrfs filesystem
+ */
+int is_btrfs_fs(const char *path)
+{
+	struct statfs stfs;
+
+	if (statfs(path, &stfs) != 0) {
+		/* cannot access */
+		return -1;
+	}
+
+	if (stfs.f_type != BTRFS_SUPER_MAGIC) {
+		/* not a btrfs filesystem */
+		return -2;
+	}
+
+	return 0;
+}
+
+/*
+ * check if the user is root
+ */
+void check_root_or_exit()
+{
+    if (geteuid() == 0)
+        return;
+
+    error("You need to be root to execute this command");
+    exit(100);
+}
+
+void check_btrfs_or_exit(const char *path)
+{
+    if (!is_btrfs_fs(path))
+        return;
+
+    error("'%s' must be a valid btrfs filesystem", path);
+    exit(100);
+}
diff --git a/utils.h b/utils.h
index 98bfb34..0bd6ecb 100644
--- a/utils.h
+++ b/utils.h
@@ -399,4 +399,18 @@ unsigned int rand_range(unsigned int upper);
 /* Also allow setting the seed manually */
 void init_rand_seed(u64 seed);
 
+/* return 0 if path is a valid btrfs filesystem */
+int is_btrfs_fs(const char *path);
+
+/*
+ * check if the user has the root capability, otherwise it exits printing an
+ * error message
+ */
+void check_root_or_exit();
+/*
+ * check if path is a valid btrfs filesystem, otherwise it exits printing an
+ * error message
+ */
+void check_btrfs_or_exit(const char *path);
+
 #endif
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/5] New btrfs command: "btrfs inspect physical-find"
  2016-07-27 17:43 [BTRFS-PROGS][PATCH][V2] Add two new commands: 'btrfs insp physical-find' and 'btrfs insp physical-dump' Goffredo Baroncelli
  2016-07-27 17:43 ` [PATCH 1/5] Add some helper functions Goffredo Baroncelli
@ 2016-07-27 17:43 ` Goffredo Baroncelli
  2016-07-28  1:47   ` Qu Wenruo
  2016-07-27 17:43 ` [PATCH 3/5] new command btrfs inspect physical-dump Goffredo Baroncelli
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Goffredo Baroncelli @ 2016-07-27 17:43 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, Chris Mason, Qu Wenruo, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

The aim of this new command is to show the physical placement on the disk
of a file.
Currently it handles all the profiles (single, dup, raid1/10/5/6).

The syntax is simple:

where:
  <filename> is the file to inspect
  <offset> is the offset of the file to inspect (default 0)

Below some examples:

** Single

$ sudo mkfs.btrfs -f -d single -m single /dev/loop0
$ sudo mount /dev/loop0 mnt/
$ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
mnt/out.txt: 0
        devid: 1 dev_name: /dev/loop0 offset: 12582912 type: LINEAR
$ dd 2>/dev/null if=/dev/loop0 skip=12582912 bs=1 count=5; echo
adaaa

** Dup

The command shows both the copies

$ sudo mkfs.btrfs -f -d single -m single /dev/loop0
$ sudo mount /dev/loop0 mnt/
$ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
mnt/out.txt: 0
        devid: 1 dev_name: /dev/loop0 offset: 71303168 type: DUP
        devid: 1 dev_name: /dev/loop0 offset: 104857600 type: DUP
$ dd 2>/dev/null if=/dev/loop0 skip=104857600 bs=1 count=5 ; echo
adaaa

** Raid1

The command shows both the copies

$ sudo mkfs.btrfs -f -d raid1 -m raid1 /dev/loop0 /dev/loop1
$ sudo mount /dev/loop0 mnt/
$ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0
        devid: 2 dev_name: /dev/loop1 offset: 61865984 type: RAID1
        devid: 1 dev_name: /dev/loop0 offset: 81788928 type: RAID1
$ dd 2>/dev/null if=/dev/loop0 skip=81788928 bs=1 count=5; echo
adaaa

** Raid10

The command show both the copies; if you set an offset to the next disk-stripe, you can see the next pair of disk-stripe

$ sudo mkfs.btrfs -f -d raid10 -m raid10 /dev/loop[0123]
$ sudo mount /dev/loop0 mnt/
$ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0
        devid: 4 dev_name: /dev/loop3 offset: 61931520 type: RAID10
        devid: 3 dev_name: /dev/loop2 offset: 61931520 type: RAID10
$ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5; echo
adaaa
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt 65536
mnt/out.txt: 65536
        devid: 2 dev_name: /dev/loop1 offset: 61931520 type: RAID10
        devid: 1 dev_name: /dev/loop0 offset: 81854464 type: RAID10
$ dd 2>/dev/null if=/dev/loop0 skip=81854464 bs=1 count=5; echo
bdbbb

** Raid5

Depending by the offset, you can see which disk-stripe is used.

$ sudo mkfs.btrfs -f -d raid5 -m raid5 /dev/loop[012]
$ sudo mount /dev/loop0 mnt/
$ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
mnt/out.txt: 0
        devid: 2 dev_name: /dev/loop1 offset: 61931520 type: DATA
        devid: 1 dev_name: /dev/loop0 offset: 81854464 type: OTHER
        devid: 3 dev_name: /dev/loop2 offset: 61931520 type: PARITY
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt 65536mnt/out.txt: 65536
        devid: 2 dev_name: /dev/loop1 offset: 61931520 type: OTHER
        devid: 1 dev_name: /dev/loop0 offset: 81854464 type: DATA
        devid: 3 dev_name: /dev/loop2 offset: 61931520 type: PARITY
$ dd 2>/dev/null if=/dev/loop1 skip=61931520 bs=1 count=5; echo
adaaa
$ dd 2>/dev/null if=/dev/loop0 skip=81854464 bs=1 count=5; echo
bdbbb
$ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5 | xxd
00000000: 0300 0303 03                             .....

The parity is computed as: parity=disk1^disk2. So "adaa" ^ "bdbb" == "\x03\x00\x03\x03

** Raid6
$ sudo mkfs.btrfs -f -mraid6 -draid6 /dev/loop[0-4]^C
$ sudo mount /dev/loop0 mnt/
$ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
mnt/out.txt: 0
        devid: 3 dev_name: /dev/loop2 offset: 61931520 type: DATA
        devid: 2 dev_name: /dev/loop1 offset: 61931520 type: OTHER
        devid: 1 dev_name: /dev/loop0 offset: 81854464 type: PARITY
        devid: 4 dev_name: /dev/loop3 offset: 61931520 type: PARITY

$ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5 ; echo
adaaa


Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 cmds-inspect.c | 587 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 587 insertions(+)

diff --git a/cmds-inspect.c b/cmds-inspect.c
index dd7b9dd..fc2e7c3 100644
--- a/cmds-inspect.c
+++ b/cmds-inspect.c
@@ -22,6 +22,11 @@
 #include <errno.h>
 #include <getopt.h>
 #include <limits.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <linux/fs.h>
+#include <linux/fiemap.h>
 
 #include "kerncompat.h"
 #include "ioctl.h"
@@ -623,6 +628,586 @@ out:
 	return !!ret;
 }
 
+
+static const char * const cmd_inspect_physical_find_usage[] = {
+	"btrfs inspect-internal physical-find <path> [<off>|-l <logical>]",
+	"Show the physical placement of a file data.",
+	"<path>     file to show",
+	"<off>      file offset to show; 0 if not specified",
+	"<logical>  show info about a logical address instead of a file",
+	"This command requires root privileges",
+	NULL
+};
+
+#define STRIPE_INFO_LINEAR		1
+#define STRIPE_INFO_DUP			2
+#define STRIPE_INFO_RAID0		3
+#define STRIPE_INFO_RAID1		4
+#define STRIPE_INFO_RAID10		5
+#define STRIPE_INFO_RAID56_DATA		6
+#define STRIPE_INFO_RAID56_OTHER	7
+#define STRIPE_INFO_RAID56_PARITY	8
+
+static const char * const stripe_info_descr[] = {
+	[STRIPE_INFO_LINEAR] = "LINEAR",
+	[STRIPE_INFO_DUP] = "DUP",
+	[STRIPE_INFO_RAID0] = "RAID0",
+	[STRIPE_INFO_RAID1] = "RAID1",
+	[STRIPE_INFO_RAID10] = "RAID10",
+	[STRIPE_INFO_RAID56_DATA] = "DATA",
+	[STRIPE_INFO_RAID56_OTHER] = "OTHER",
+	[STRIPE_INFO_RAID56_PARITY] = "PARITY",
+};
+
+struct stripe_info {
+	u64 devid;
+	const char *dname;
+	u64 phy_start;
+	int type;
+};
+
+static void add_stripe_info(struct stripe_info **list, int *count,
+	u64 devid, const char *dname, u64 phy_start, int type) {
+
+	if (*list == NULL)
+		*count = 0;
+
+	++*count;
+	*list = realloc(*list, sizeof(struct stripe_info) * *count);
+	/*
+	 * It is rude, but it should not happen for this kind of allocation...
+	 * ... and anyway when it happens, there are more severe problems
+	 * that this handling of "not enough memory"
+	 */
+	if (*list == NULL) {
+		error("Not nough memory: abort\n");
+		exit(100);
+	}
+
+	(*list)[*count-1].devid = devid;
+	(*list)[*count-1].dname = dname;
+	(*list)[*count-1].phy_start = phy_start;
+	(*list)[*count-1].type = type;
+}
+
+static void dump_stripes(int ndisks, struct btrfs_ioctl_dev_info_args *disks,
+			 struct btrfs_chunk *chunk, u64 logical_start,
+			 struct stripe_info **stripes_ret, int *stripes_count) {
+	struct btrfs_stripe *stripes;
+
+	stripes = &chunk->stripe;
+
+	if ((chunk->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0) {
+		/* LINEAR: each chunk has (should have) only one disk */
+		int j;
+		char *dname = "<NOT FOUND>";
+
+		assert(chunk->num_stripes == 1);
+
+		u64 phy_start = stripes[0].offset +
+			+logical_start;
+		for (j = 0 ; j < ndisks ; j++) {
+			if (stripes[0].devid == disks[j].devid) {
+				dname = (char *)disks[j].path;
+				break;
+			}
+		}
+
+		add_stripe_info(stripes_ret, stripes_count,
+			stripes[0].devid, dname, phy_start,
+			STRIPE_INFO_LINEAR);
+	} else if (chunk->type & BTRFS_BLOCK_GROUP_RAID0) {
+		/*
+		 * RAID0: each chunk is composed by more disks;
+		 * each stripe_len bytes are in a different disk:
+		 *
+		 *  file: ABC...NMOP....
+		 *
+		 *      disk1   disk2    disk3  .... disksN
+		 *
+		 *        A      B         C    ....    N
+		 *        M      O         P    ....
+		 *
+		 */
+		u64 disks_number = chunk->num_stripes;
+		u64 disk_stripe_size = chunk->stripe_len;
+		u64 stripe_capacity;
+		u64 stripe_nr;
+		u64 disk_stripe_start;
+		int sidx;
+		int j;
+		char *dname = "<NOT FOUND>";
+
+		stripe_capacity = disks_number * disk_stripe_size;
+		stripe_nr = logical_start / stripe_capacity;
+		disk_stripe_start = logical_start % disk_stripe_size;
+
+		sidx = (logical_start / disk_stripe_size) % disks_number;
+
+		u64 phy_start = stripes[sidx].offset +
+			stripe_nr * disk_stripe_size +
+			disk_stripe_start;
+
+		for (j = 0 ; j < ndisks ; j++) {
+			if (stripes[sidx].devid == disks[j].devid) {
+				dname = (char *)disks[j].path;
+				break;
+			}
+		}
+
+		add_stripe_info(stripes_ret, stripes_count,
+			stripes[sidx].devid, dname, phy_start,
+			STRIPE_INFO_RAID0);
+
+	} else if (chunk->type & BTRFS_BLOCK_GROUP_RAID1) {
+		/*
+		 * RAID0: each chunk is composed by more disks;
+		 * each stripe_len bytes are in a different disk:
+		 *
+		 *  file: ABC...
+		 *
+		 *      disk1   disk2   disk3  ....
+		 *
+		 *        A       A
+		 *        B       B
+		 *        C       C
+		 *
+		 */
+		int sidx;
+
+		for (sidx = 0; sidx < chunk->num_stripes; sidx++) {
+			int j;
+			char *dname = "<NOT FOUND>";
+			u64 phy_start = stripes[sidx].offset +
+				+logical_start;
+
+			for (j = 0 ; j < ndisks ; j++) {
+				if (stripes[sidx].devid == disks[j].devid) {
+					dname = (char *)disks[j].path;
+					break;
+				}
+			}
+			add_stripe_info(stripes_ret, stripes_count,
+				stripes[sidx].devid, dname, phy_start,
+				STRIPE_INFO_RAID1);
+		}
+
+	} else if (chunk->type & BTRFS_BLOCK_GROUP_DUP) {
+		/*
+		 * DUP: each chunk has 'num_stripes' disk_stripe. Heach
+		 * disk_stripe has its own copy of data
+		 *
+		 *  file: ABCD....
+		 *
+		 *      disk1   disk2   disk3
+		 *
+		 *        A
+		 *        B
+		 *        C
+		 *      [...]
+		 *        A
+		 *        B
+		 *        C
+		 *
+		 *
+		 * NOTE: the difference between DUP and RAID1 is that
+		 * in RAID1 each disk_stripe is in a different disk, in DUP
+		 * each disk chunk is in the same disk
+		 */
+		int sidx;
+
+		for (sidx = 0; sidx < chunk->num_stripes; sidx++) {
+			int j;
+			char *dname = "<NOT FOUND>";
+			u64 phy_start = stripes[sidx].offset +
+				+logical_start;
+
+			for (j = 0 ; j < ndisks ; j++) {
+				if (stripes[sidx].devid == disks[j].devid) {
+					dname = (char *)disks[j].path;
+					break;
+				}
+			}
+
+			add_stripe_info(stripes_ret, stripes_count,
+				stripes[sidx].devid, dname, phy_start,
+				STRIPE_INFO_DUP);
+		}
+	} else if (chunk->type & BTRFS_BLOCK_GROUP_RAID10) {
+		/*
+		 * RAID10: each chunk is composed by more disks;
+		 * each stripe_len bytes are in a different disk:
+		 *
+		 *  file: ABCD....
+		 *
+		 *      disk1   disk2   disk3   disk4
+		 *
+		 *        A      A         B      B
+		 *        C      C         D      D
+		 *
+		 *
+		 */
+		int i;
+		u64 disks_number = chunk->num_stripes;
+		u64 disk_stripe_size = chunk->stripe_len;
+		u64 stripe_capacity;
+		u64 stripe_nr;
+		u64 stripe_start;
+		u64 disk_stripe_start;
+
+		stripe_capacity = disks_number * disk_stripe_size / chunk->sub_stripes;
+		stripe_nr = logical_start / stripe_capacity;
+		stripe_start = logical_start % stripe_capacity;
+		disk_stripe_start = logical_start % disk_stripe_size;
+
+		for (i = 0; i < chunk->sub_stripes; i++) {
+			int j;
+			char *dname = "<NOT FOUND>";
+			int sidx = (i +
+				stripe_start/disk_stripe_size*chunk->sub_stripes) %
+				disks_number;
+
+			u64 phy_start = stripes[sidx].offset +
+				+stripe_nr*disk_stripe_size + disk_stripe_start;
+
+			for (j = 0 ; j < ndisks ; j++) {
+				if (stripes[sidx].devid == disks[j].devid) {
+					dname = (char *)disks[j].path;
+					break;
+				}
+			}
+
+			add_stripe_info(stripes_ret, stripes_count,
+				stripes[sidx].devid, dname, phy_start,
+				STRIPE_INFO_RAID10);
+
+		}
+	} else if (chunk->type & BTRFS_BLOCK_GROUP_RAID5 ||
+			chunk->type & BTRFS_BLOCK_GROUP_RAID6) {
+		/*
+		 * RAID5: each chunk is spread on a different disk; however one
+		 * disk is used for parity
+		 *
+		 *  file: ABCDEFGHIJK....
+		 *
+		 *      disk1  disk2  disk3  disk4  disk5
+		 *
+		 *        A      B      C      D      P
+		 *        P      D      E      F      G
+		 *        H      P      I      J      K
+		 *
+		 *   Note: P == parity
+		 *
+		 * RAID6: each chunk is spread on a different disk; however two
+		 * disks are used for parity
+		 *
+		 *  file: ABCDEFGHI...
+		 *
+		 *      disk1  disk2  disk3  disk4  disk5
+		 *
+		 *        A      B      C      P      Q
+		 *        Q      D      E      F      P
+		 *        P      Q      G      H      I
+		 *
+		 *   Note: P,Q == parity
+		 *
+		 */
+		int parities_nr = 1;
+		u64 disks_number = chunk->num_stripes;
+		u64 disk_stripe_size = chunk->stripe_len;
+		u64 stripe_capacity;
+		u64 stripe_nr;
+		u64 stripe_start;
+		u64 pos = 0;
+		u64 disk_stripe_start;
+		int sidx;
+
+		if (chunk->type & BTRFS_BLOCK_GROUP_RAID6)
+			parities_nr = 2;
+
+		stripe_capacity = (disks_number - parities_nr) *
+						disk_stripe_size;
+		stripe_nr = logical_start / stripe_capacity;
+		stripe_start = logical_start % stripe_capacity;
+		disk_stripe_start = logical_start % disk_stripe_size;
+
+		for (sidx = 0; sidx < disks_number ; sidx++) {
+			int j;
+			char *dname = "<NOT FOUND>";
+			u64 stripe_index = (sidx + stripe_nr) % disks_number;
+			u64 phy_start = stripes[stripe_index].offset + /* chunk start */
+				+ stripe_nr*disk_stripe_size +  /* stripe start */
+				+ disk_stripe_start;
+
+			for (j = 0 ; j < ndisks ; j++)
+				if (stripes[stripe_index].devid == disks[j].devid) {
+				dname = (char *)disks[j].path;
+				break;
+				}
+
+			if (sidx >= (disks_number - parities_nr)) {
+				add_stripe_info(stripes_ret, stripes_count,
+					stripes[stripe_index].devid, dname, phy_start,
+					STRIPE_INFO_RAID56_PARITY);
+				continue;
+			}
+
+			if (stripe_start >= pos && stripe_start < (pos+disk_stripe_size)) {
+				add_stripe_info(stripes_ret, stripes_count,
+					stripes[stripe_index].devid, dname, phy_start,
+					STRIPE_INFO_RAID56_DATA);
+			} else {
+				add_stripe_info(stripes_ret, stripes_count,
+					stripes[stripe_index].devid, dname, phy_start,
+					STRIPE_INFO_RAID56_OTHER);
+			}
+
+			pos += disk_stripe_size;
+		}
+		assert(pos == stripe_capacity);
+	} else {
+		error("Unknown chunk type = 0x%016llx\n", chunk->type);
+		return;
+	}
+
+}
+
+static int get_chunk_offset(int fd, u64 logical_start,
+	struct btrfs_chunk *chunk_ret, u64 *off_ret) {
+
+	struct btrfs_ioctl_search_args args;
+	struct btrfs_ioctl_search_key *sk = &args.key;
+	struct btrfs_ioctl_search_header sh;
+	unsigned long off = 0;
+	int i;
+
+	memset(&args, 0, sizeof(args));
+	sk->tree_id = BTRFS_CHUNK_TREE_OBJECTID;
+	sk->min_objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
+	sk->max_objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
+	sk->min_type = BTRFS_CHUNK_ITEM_KEY;
+	sk->max_type = BTRFS_CHUNK_ITEM_KEY;
+	sk->max_offset = (u64)-1;
+	sk->min_offset = 0;
+	sk->max_transid = (u64)-1;
+
+	while (1) {
+		int ret;
+
+		sk->nr_items = 1;
+		ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, &args);
+		if (ret < 0)
+			return -errno;
+
+		if (sk->nr_items == 0)
+			break;
+
+		off = 0;
+		for (i = 0; i < sk->nr_items; i++) {
+			struct btrfs_chunk *item;
+
+			memcpy(&sh, args.buf + off, sizeof(sh));
+			off += sizeof(sh);
+			item = (struct btrfs_chunk *)(args.buf + off);
+			off += sh.len;
+
+			if (logical_start >= sh.offset &&
+			    logical_start < sh.offset+item->length) {
+				memcpy(chunk_ret, item, sh.len);
+				*off_ret = logical_start-sh.offset;
+				return 0;
+			}
+
+			sk->min_objectid = sh.objectid;
+			sk->min_type = sh.type;
+			sk->min_offset = sh.offset;
+		}
+
+		if (sk->min_offset < (u64)-1)
+			sk->min_offset++;
+		else
+			break;
+	}
+
+	return 1; /* not found */
+}
+
+/*
+ * Inline extents are skipped because they do not take data space,
+ * delalloc and unknown are skipped because we do not know how much
+ * space they will use yet.
+ */
+#define SKIP_FLAGS	(FIEMAP_EXTENT_UNKNOWN|FIEMAP_EXTENT_DELALLOC| \
+			 FIEMAP_EXTENT_DATA_INLINE)
+
+static int get_file_offset(int fd, u64 file_offset, u64 *logical)
+{
+	char buf[16384];
+	struct fiemap *fiemap = (struct fiemap *)buf;
+	struct fiemap_extent *fm_ext;
+	const int count = (sizeof(buf) - sizeof(*fiemap)) /
+					sizeof(struct fiemap_extent);
+	int last = 0;
+
+
+	memset(fiemap, 0, sizeof(struct fiemap));
+
+	do {
+
+		int rc;
+		int j;
+
+		fiemap->fm_length = ~0ULL;
+		fiemap->fm_extent_count = count;
+		fiemap->fm_flags = FIEMAP_FLAG_SYNC;
+		rc = ioctl(fd, FS_IOC_FIEMAP, (unsigned long) fiemap);
+		if (rc < 0)
+			return -errno;
+
+		for (j = 0; j < fiemap->fm_mapped_extents; j++) {
+			u32 flags;
+
+			fm_ext = &fiemap->fm_extents[j];
+			flags = fm_ext->fe_flags;
+
+			fiemap->fm_start = (fm_ext->fe_logical +
+					fm_ext->fe_length);
+
+			if (flags & FIEMAP_EXTENT_LAST)
+				last = 1;
+
+			if (flags & SKIP_FLAGS)
+				continue;
+
+			if (file_offset > fm_ext->fe_logical +
+			fm_ext->fe_length)
+				continue;
+
+			*logical = fm_ext->fe_physical + file_offset -
+				   fm_ext->fe_logical;
+			return 0;
+		}
+	} while (last == 0);
+
+	return 1;
+}
+static int cmd_inspect_physical_find(int argc, char **argv)
+{
+	int ret = 0;
+	int fd = -1;
+	char *fname;
+	struct btrfs_ioctl_dev_info_args *disks = NULL;
+	struct btrfs_ioctl_fs_info_args fi_args = {0};
+	char btrfs_chunk_data[4096];
+	struct btrfs_chunk *chunk_item = (struct btrfs_chunk *)&btrfs_chunk_data;
+	u64 chunk_offset = 0;
+	struct stripe_info *stripes = NULL;
+	int stripes_count = 0;
+	int i;
+	int rc;
+	const char *logical_arg = NULL;
+	u64 logical = 0ull;
+
+
+	optind = 1;
+	while (1) {
+		int c = getopt(argc, argv, "l:");
+
+		if (c < 0)
+			break;
+
+		switch (c) {
+		case 'l':
+			logical_arg = optarg;
+			break;
+		default:
+			usage(cmd_inspect_physical_find_usage);
+		}
+	}
+
+	if ((logical_arg != NULL && check_argc_exact(argc - optind, 1)) ||
+	    (check_argc_min(argc - optind, 1) || check_argc_max(argc - optind, 2)))
+		usage(cmd_inspect_physical_find_usage);
+
+	fname = argv[optind];
+
+	check_root_or_exit();
+	check_btrfs_or_exit(fname);
+
+	fd = open(fname, O_RDONLY);
+	if (fd < 0) {
+		error("Can't open '%s' for reading\n", fname);
+		ret = -errno;
+		goto out;
+	}
+
+	if (logical_arg == NULL) {
+		u64 file_offset = 0ull;
+
+		if (argc - optind == 2)
+			file_offset = strtoull(argv[optind+1], NULL, 0);
+		ret = get_file_offset(fd, file_offset, &logical);
+		if (ret > 0) {
+			error("Can't find the extent: the file is too short, or the file is stored in a leaf.\n");
+			ret = 10;
+			goto out;
+		} else if (ret < 0) {
+			int e = -ret;
+
+			error("Can't do ioctl() [errno=%d: %s]\n", e, strerror(e));
+			ret = 11;
+			goto out;
+		}
+
+		printf("logical: %llu offset: %llu file: %s\n", logical,
+		       file_offset, fname);
+	} else {
+		logical = strtoull(logical_arg, NULL, 0);
+		printf("logical: %llu filesystem: %s\n", logical, fname);
+	}
+
+	rc = get_fs_info(fname, &fi_args, &disks);
+	if (rc < 0) {
+		error("Cannot get info for the filesystem: may be it is not a btrfs filesystem ?\n");
+		ret = 12;
+		goto out;
+	}
+
+	rc = get_chunk_offset(fd,
+		logical,
+		chunk_item, &chunk_offset);
+	if (rc < 0) {
+		error("cannot perform the search: %s", strerror(rc));
+		ret = 13;
+		goto out;
+	}
+	if (rc != 0) {
+		error("cannot find chunk\n");
+		ret = 14;
+		goto out;
+	}
+
+	dump_stripes(fi_args.num_devices, disks,
+		     chunk_item, chunk_offset,
+		     &stripes, &stripes_count);
+
+	for (i = 0 ; i < stripes_count ; i++) {
+		printf("devid: %llu dev_name: %s offset: %llu type: %s\n",
+			stripes[i].devid, stripes[i].dname,
+			stripes[i].phy_start,
+			stripe_info_descr[stripes[i].type]);
+	}
+
+out:
+	if (fd != -1)
+		close(fd);
+	if (disks != NULL)
+		free(disks);
+	if (stripes != NULL)
+		free(stripes);
+	return ret;
+}
+
 static const char inspect_cmd_group_info[] =
 "query various internal information";
 
@@ -644,6 +1229,8 @@ const struct cmd_group inspect_cmd_group = {
 				cmd_inspect_dump_super_usage, NULL, 0 },
 		{ "tree-stats", cmd_inspect_tree_stats,
 				cmd_inspect_tree_stats_usage, NULL, 0 },
+		{ "physical-find", cmd_inspect_physical_find,
+				cmd_inspect_physical_find_usage, NULL, 0 },
 		NULL_CMD_STRUCT
 	}
 };
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/5] new command btrfs inspect physical-dump
  2016-07-27 17:43 [BTRFS-PROGS][PATCH][V2] Add two new commands: 'btrfs insp physical-find' and 'btrfs insp physical-dump' Goffredo Baroncelli
  2016-07-27 17:43 ` [PATCH 1/5] Add some helper functions Goffredo Baroncelli
  2016-07-27 17:43 ` [PATCH 2/5] New btrfs command: "btrfs inspect physical-find" Goffredo Baroncelli
@ 2016-07-27 17:43 ` Goffredo Baroncelli
  2016-07-27 17:43 ` [PATCH 4/5] Add man page for command btrfs insp physical-find Goffredo Baroncelli
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Goffredo Baroncelli @ 2016-07-27 17:43 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, Chris Mason, Qu Wenruo, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

The aim of this command, is to dump the disk content of a file bypassing the
btrfs filesystem. This could help to test the btrfs filesystem.
The dump size is a page (4k) (even if the file is shorter). It is possible
to set an offset for the file portion to read, but even this offset must be
multiple of 4k.

With the switch -c , it is possible to select whch copy will be
dumped (RAID1/RAID10/DUP).
With the switch -p, it is possible to select which parity will
be dumped (RAID5/RAID6)
With the switch -s, it is possible to dump the other elemnt of the
stripe (RAID5, RAID6)

# btrfs insp physical-dump /bin/ls 8192 | xxd
/bin/ls: 8192
file: /dev/sda3 off=16600629248
00000000: b0e2 6100 0000 0000 0700 0000 5200 0000  ..a.........R...
00000010: 0000 0000 0000 0000 b8e2 6100 0000 0000  ..........a.....
00000020: 0700 0000 5300 0000 0000 0000 0000 0000  ....S...........
00000030: c0e2 6100 0000 0000 0700 0000 5400 0000  ..a.........T...
[...]


Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 cmds-inspect.c | 320 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 320 insertions(+)

diff --git a/cmds-inspect.c b/cmds-inspect.c
index fc2e7c3..0e7d725 100644
--- a/cmds-inspect.c
+++ b/cmds-inspect.c
@@ -1208,6 +1208,324 @@ out:
 	return ret;
 }
 
+static const char * const cmd_inspect_physical_dump_usage[] = {
+	"btrfs inspect-internal physical-dump [-c <copynr>|-s <stripenr>|-p <paritynr>] <path> [-l <logical>|<off>]",
+	"Dump the physical content of a file offset",
+	"<path>      file to dump",
+	"<off>       file offset to dump; 0 if not specified",
+	"<logical>   dump logical address of filesystem, <path>",
+	"<copynr>    number of copy to dump (for raid1,dup/raid10)",
+	"<paritynr>  number of parity to dump (for raid5/raid6)",
+	"<stripenr>  number of stripe elemnt to dump (for raid5/raid6)",
+	"This command requires root privileges",
+	NULL
+};
+
+static int dumpfile(const char *fname, u64 off)
+{
+	int fd = -1;
+	int size = 4096;
+	char buf[size];
+	int r;
+	int e = 0;
+	off_t r1;
+
+	fprintf(stderr, "dev: %s off=%llu\n", fname, off);
+
+	fd = open(fname, O_RDONLY|O_APPEND);
+	if (fd < 0) {
+		int e = errno;
+
+		error("cannot open file: '%s'\n", strerror(e));
+		return -e;
+	}
+
+	r1 = lseek(fd, off, SEEK_SET);
+	if (r1 == (off_t)-1) {
+		e = -errno;
+		error("cannot seek file: '%s'\n", strerror(-e));
+		goto out;
+	}
+
+	while (size) {
+		r = read(fd, buf, size);
+		if (r < 0) {
+			e = -errno;
+			error("cannot read file: '%s'\n", strerror(-e));
+			goto out;
+		}
+
+		size -= r;
+		r = fwrite(buf, r, 1, stdout);
+		if (r < 0) {
+			e = -errno;
+			error("cannot write: '%s'\n", strerror(-e));
+			goto out;
+		}
+
+	}
+
+out:
+	if (fd != -1)
+		close(fd);
+	return e;
+}
+
+static int cmd_inspect_physical_dump(int argc, char **argv)
+{
+	int ret = 0;
+	int fd;
+	char *fname;
+	u64 profile_type;
+	struct btrfs_ioctl_dev_info_args *disks = NULL;
+	struct btrfs_ioctl_fs_info_args fi_args = {0};
+	char btrfs_chunk_data[4096];
+	struct btrfs_chunk *chunk_item = (struct btrfs_chunk *)&btrfs_chunk_data;
+	u64 chunk_offset = 0;
+	struct stripe_info *stripes = NULL;
+	int stripes_count = 0;
+	int rc;
+	int copynr = 0;
+	int paritynr = -1;
+	int stripenr = -1;
+	const char *logical_arg = NULL;
+	u64 logical = 0ull;
+
+	optind = 1;
+	while (1) {
+		int c = getopt(argc, argv, "c:p:s:l:");
+
+		if (c < 0)
+			break;
+
+		switch (c) {
+		case 'c':
+			copynr = atoi(optarg);
+			break;
+		case 'p':
+			paritynr = atoi(optarg);
+			break;
+		case 's':
+			stripenr = atoi(optarg);
+			break;
+		case 'l':
+			logical_arg = optarg;
+			break;
+		default:
+			usage(cmd_inspect_physical_dump_usage);
+		}
+	}
+
+	if ((logical_arg != NULL && check_argc_exact(argc - optind, 1)) ||
+	    (check_argc_min(argc - optind, 1) || check_argc_max(argc - optind, 2)))
+		usage(cmd_inspect_physical_dump_usage);
+
+	fname = argv[optind];
+
+	check_root_or_exit();
+	check_btrfs_or_exit(fname);
+
+	fprintf(stderr, "%s: %llu\n", fname, logical);
+
+	fd = open(fname, O_RDONLY|O_DIRECT);
+	if (fd < 0) {
+		error("Can't open '%s' for reading.\n", fname);
+		ret = -errno;
+		goto out;
+	}
+
+	if (logical_arg == NULL) {
+		u64 file_offset = 0ull;
+
+		if (argc - optind == 2)
+			file_offset = strtoull(argv[optind+1], NULL, 0);
+
+		if (file_offset % 4096) {
+			error("<off> must be multiple of 4096 !");
+			return 11;
+		}
+		ret = get_file_offset(fd, file_offset, &logical);
+		if (ret > 0) {
+			error("Can't find the extent: the file is too short, or the file is stored in a leaf.\n");
+			ret = 10;
+			goto out;
+		} else if (ret < 0) {
+			int e = -ret;
+
+			error("Can't do ioctl() [errno=%d: %s]\n", e, strerror(e));
+			ret = 11;
+			goto out;
+		}
+
+		fprintf(stderr, "logical: %llu offset: %llu file: %s\n",
+			logical, file_offset, fname);
+	} else {
+		logical = strtoull(logical_arg, NULL, 0);
+		if (logical % 4096) {
+			error("<logical> must be multiple of 4096 !");
+			return 11;
+		}
+		fprintf(stderr, "logical: %llu filesystem: %s\n",
+			logical, fname);
+	}
+
+	rc = get_fs_info(fname, &fi_args, &disks);
+	if (rc < 0) {
+		error("Cannot get info for the filesystem: may be it is not a btrfs filesystem ?\n");
+		ret = 12;
+		goto out;
+	}
+
+	rc = get_chunk_offset(fd, logical,
+		chunk_item, &chunk_offset);
+	if (rc < 0) {
+		error("cannot perform the search: %s", strerror(rc));
+		ret = 13;
+		goto out;
+	}
+	if (rc != 0) {
+		error("cannot find chunk\n");
+		ret = 14;
+		goto out;
+	}
+
+	dump_stripes(fi_args.num_devices, disks,
+		     chunk_item, chunk_offset,
+		     &stripes, &stripes_count);
+
+	profile_type = chunk_item->type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
+	if (profile_type == 0 || profile_type & BTRFS_BLOCK_GROUP_RAID0) {
+
+		if (copynr != 0) {
+			error("-c <copynr> is not valid for profile '%s'\n",
+			      btrfs_group_profile_str(profile_type));
+			ret = 16;
+			goto out;
+		}
+		if (stripenr != -1) {
+			error("-s <stripenr> is not valid for profile '%s'\n",
+			      btrfs_group_profile_str(profile_type));
+			ret = 16;
+			goto out;
+		}
+		if (paritynr != -1) {
+			error("-p <paritynr> is not valid for profile '%s'\n",
+			      btrfs_group_profile_str(profile_type));
+			ret = 16;
+			goto out;
+		}
+
+		ret = dumpfile(stripes[0].dname, stripes[0].phy_start);
+
+	} else if (profile_type & BTRFS_BLOCK_GROUP_RAID1 ||
+			profile_type & BTRFS_BLOCK_GROUP_DUP ||
+			profile_type & BTRFS_BLOCK_GROUP_RAID10) {
+
+		if (stripenr != -1) {
+			error("-s <stripenr> is not valid for profile '%s'\n",
+			      btrfs_group_profile_str(profile_type));
+			ret = 16;
+			goto out;
+		}
+		if (paritynr != -1) {
+			error("-p <paritynr> is not valid for profile '%s'\n",
+			      btrfs_group_profile_str(profile_type));
+			ret = 16;
+			goto out;
+		}
+		if (copynr < 0 || copynr > 1) {
+			error("<copynr>=%d is not valid for profile '%s'\n",
+			      copynr, btrfs_group_profile_str(profile_type));
+			ret = 16;
+			goto out;
+		}
+
+		ret = dumpfile(stripes[copynr].dname, stripes[copynr].phy_start);
+
+	} else if (profile_type & BTRFS_BLOCK_GROUP_RAID5 ||
+		   profile_type & BTRFS_BLOCK_GROUP_RAID6) {
+
+		int maxparity = 0;
+		int stripeid = -1;
+
+		if (profile_type & BTRFS_BLOCK_GROUP_RAID6)
+			maxparity = 1;
+
+		if (copynr != 0) {
+			error("-c <copynr> is not valid for profile '%s'\n",
+			      btrfs_group_profile_str(profile_type));
+			ret = 16;
+			goto out;
+		}
+		if (paritynr != -1 && stripenr != -1) {
+			error("You cannot pass both -p <paritynr> and -s <stripenr> for profile '%s'\n",
+				btrfs_group_profile_str(profile_type));
+			ret = 16;
+			goto out;
+		}
+		if (paritynr < -1 || paritynr > maxparity) {
+			error("<paritynr>=%d is not valid for profile '%s'\n",
+				paritynr, btrfs_group_profile_str(profile_type));
+			ret = 16;
+			goto out;
+		}
+		if (stripenr < -1 || stripenr > (stripes_count - maxparity - 3)) {
+			error("<stripenr>=%d is not valid for profile '%s' [%d disks]\n",
+				stripenr, btrfs_group_profile_str(profile_type),
+				stripes_count);
+			ret = 16;
+			goto out;
+		}
+		if (stripenr == -1 && paritynr == -1) {
+			int i;
+
+			for (i = 0 ; i < stripes_count ; i++) {
+				if (stripes[i].type == STRIPE_INFO_RAID56_DATA) {
+					stripeid = i;
+					break;
+				}
+			}
+		} else if (paritynr != -1) {
+			int i;
+
+			for (i = 0 ; i < stripes_count ; i++) {
+				if (stripes[i].type == STRIPE_INFO_RAID56_PARITY)
+					--paritynr;
+				if (paritynr == -1) {
+					stripeid = i;
+					break;
+				}
+			}
+		} else {
+			int i;
+
+			for (i = 0 ; i < stripes_count ; i++) {
+				if (stripes[i].type == STRIPE_INFO_RAID56_OTHER)
+					--stripenr;
+				if (stripenr == -1) {
+					stripeid = i;
+					break;
+				}
+			}
+		}
+
+		assert(stripeid >= 0 && stripeid < stripes_count);
+
+		ret = dumpfile(stripes[stripeid].dname,
+			       stripes[stripeid].phy_start);
+
+	}
+
+out:
+	if (fd != -1)
+		close(fd);
+	if (disks != NULL)
+		free(disks);
+	if (stripes != NULL)
+		free(stripes);
+	return ret;
+}
+
 static const char inspect_cmd_group_info[] =
 "query various internal information";
 
@@ -1231,6 +1549,8 @@ const struct cmd_group inspect_cmd_group = {
 				cmd_inspect_tree_stats_usage, NULL, 0 },
 		{ "physical-find", cmd_inspect_physical_find,
 				cmd_inspect_physical_find_usage, NULL, 0 },
+		{ "physical-dump", cmd_inspect_physical_dump,
+				cmd_inspect_physical_dump_usage, NULL, 0 },
 		NULL_CMD_STRUCT
 	}
 };
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/5] Add man page for command btrfs insp physical-find
  2016-07-27 17:43 [BTRFS-PROGS][PATCH][V2] Add two new commands: 'btrfs insp physical-find' and 'btrfs insp physical-dump' Goffredo Baroncelli
                   ` (2 preceding siblings ...)
  2016-07-27 17:43 ` [PATCH 3/5] new command btrfs inspect physical-dump Goffredo Baroncelli
@ 2016-07-27 17:43 ` Goffredo Baroncelli
  2016-07-27 17:43 ` [PATCH 5/5] Add new command to man pages: btrfs insp physical-dump Goffredo Baroncelli
  2016-07-28 12:03 ` [BTRFS-PROGS][PATCH][V2] Add two new commands: 'btrfs insp physical-find' and 'btrfs insp physical-dump' David Sterba
  5 siblings, 0 replies; 16+ messages in thread
From: Goffredo Baroncelli @ 2016-07-27 17:43 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, Chris Mason, Qu Wenruo, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 Documentation/btrfs-inspect-internal.asciidoc | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/Documentation/btrfs-inspect-internal.asciidoc b/Documentation/btrfs-inspect-internal.asciidoc
index 74f6dea..c132a0e 100644
--- a/Documentation/btrfs-inspect-internal.asciidoc
+++ b/Documentation/btrfs-inspect-internal.asciidoc
@@ -146,6 +146,13 @@ Print sizes and statistics of trees.
 -b::::
 Print raw numbers in bytes.
 
+*physical-find* <path> [<off>|-l <logical>]::
+(needs root privileges)
++
+Show the placement of a given file (at offset 'off', default 0) on the disks.
+If 'logical' is given, this command shows the palcement of a logical address
+on the disks.
+
 EXIT STATUS
 -----------
 *btrfs inspect-internal* returns a zero exit status if it succeeds. Non zero is
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/5] Add new command to man pages: btrfs insp physical-dump
  2016-07-27 17:43 [BTRFS-PROGS][PATCH][V2] Add two new commands: 'btrfs insp physical-find' and 'btrfs insp physical-dump' Goffredo Baroncelli
                   ` (3 preceding siblings ...)
  2016-07-27 17:43 ` [PATCH 4/5] Add man page for command btrfs insp physical-find Goffredo Baroncelli
@ 2016-07-27 17:43 ` Goffredo Baroncelli
  2016-07-28 12:03 ` [BTRFS-PROGS][PATCH][V2] Add two new commands: 'btrfs insp physical-find' and 'btrfs insp physical-dump' David Sterba
  5 siblings, 0 replies; 16+ messages in thread
From: Goffredo Baroncelli @ 2016-07-27 17:43 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, Chris Mason, Qu Wenruo, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 Documentation/btrfs-inspect-internal.asciidoc | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/Documentation/btrfs-inspect-internal.asciidoc b/Documentation/btrfs-inspect-internal.asciidoc
index c132a0e..569681d 100644
--- a/Documentation/btrfs-inspect-internal.asciidoc
+++ b/Documentation/btrfs-inspect-internal.asciidoc
@@ -153,6 +153,18 @@ Show the placement of a given file (at offset 'off', default 0) on the disks.
 If 'logical' is given, this command shows the palcement of a logical address
 on the disks.
 
+*physical-dump* [-c <copynr>|-s <stripenr>|-p <paritynr>] <path> [<off>|-l <logical>]::
+(needs root privileges)
++
+Dump the disk content of a given file (at offset 'off', default 0). If 'logical'
+is passed, it is dumped the content of the given logical address.
+For RAID1/RAID10/DUP 'copynr', select which copy will be dumped. For
+RAID5/RAID6, 'paritynr' specifies which parity will be dumped. For
+RAID5/RAID6, 'stripenr' specifies which stripe elemnt will be dumped.
++
+'off' and 'logical' must be a multiple of 4096. 4096 bytes are dumped, even if
+the file is shorter.
+
 EXIT STATUS
 -----------
 *btrfs inspect-internal* returns a zero exit status if it succeeds. Non zero is
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] Add some helper functions
  2016-07-27 17:43 ` [PATCH 1/5] Add some helper functions Goffredo Baroncelli
@ 2016-07-28  1:03   ` Qu Wenruo
  0 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2016-07-28  1:03 UTC (permalink / raw)
  To: Goffredo Baroncelli, linux-btrfs
  Cc: dsterba, Chris Mason, Goffredo Baroncelli



At 07/28/2016 01:43 AM, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli <kreijack@inwind.it>
>
> Add the following functions:
> - int is_btrfs_fs(const char *path) -> returns 0 if path is a btrfs filesystem
> - void check_root_or_exit() -> checks if the user has the root capability or
>                                it exits writing an error message
> - void check_btrfs_or_exit(const char *path)
> 				checks if path is a valid btrfs filesystem,
> 				otherwise it exits
>
> Signed-off-by: Goffredo baroncelli <kreijack@inwind.it>
> ---
>  utils.c | 41 +++++++++++++++++++++++++++++++++++++++++
>  utils.h | 14 ++++++++++++++
>  2 files changed, 55 insertions(+)
>
> diff --git a/utils.c b/utils.c
> index 578fdb0..b99706c 100644
> --- a/utils.c
> +++ b/utils.c
> @@ -4131,3 +4131,44 @@ unsigned int rand_range(unsigned int upper)
>  	 */
>  	return (unsigned int)(jrand48(rand_seed) % upper);
>  }
> +
> +/*
> + * check if path is a btrfs filesystem
> + */
> +int is_btrfs_fs(const char *path)
> +{
> +	struct statfs stfs;
> +
> +	if (statfs(path, &stfs) != 0) {
> +		/* cannot access */
> +		return -1;
> +	}
> +
> +	if (stfs.f_type != BTRFS_SUPER_MAGIC) {
> +		/* not a btrfs filesystem */
> +		return -2;
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * check if the user is root
> + */
> +void check_root_or_exit()
> +{
> +    if (geteuid() == 0)
> +        return;
> +
> +    error("You need to be root to execute this command");
> +    exit(100);
No immediate exit value, especially such like 100.

Normally we only use 1 and 0 as exit value.

Another concern about the function is, we don't really do such early 
check on root privilege.

Under most case, we just call privilege function, like tree search 
ioctl, and when it fails, it will return -EPERM to info user that they 
lacks the privilege.

Such behavior makes code more extendable, for case like the ioctl 
becomes non-privilege, btrfs-progs don't need any modification.

So I think it's better to let ioctl itself to do the privilege check 
other than in btrfs-progs.
> +}
> +
> +void check_btrfs_or_exit(const char *path)
> +{
> +    if (!is_btrfs_fs(path))
> +        return;
> +
> +    error("'%s' must be a valid btrfs filesystem", path);
> +    exit(100);
> +}

Same exit value problem.

This btrfs check seems quite good.

What about merge it into functions like open_file_or_dir?
As most caller uses such function to open file/dir inside a btrfs mount 
point.

Thanks,
Qu
> diff --git a/utils.h b/utils.h
> index 98bfb34..0bd6ecb 100644
> --- a/utils.h
> +++ b/utils.h
> @@ -399,4 +399,18 @@ unsigned int rand_range(unsigned int upper);
>  /* Also allow setting the seed manually */
>  void init_rand_seed(u64 seed);
>
> +/* return 0 if path is a valid btrfs filesystem */
> +int is_btrfs_fs(const char *path);
> +
> +/*
> + * check if the user has the root capability, otherwise it exits printing an
> + * error message
> + */
> +void check_root_or_exit();
> +/*
> + * check if path is a valid btrfs filesystem, otherwise it exits printing an
> + * error message
> + */
> +void check_btrfs_or_exit(const char *path);
> +
>  #endif
>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/5] New btrfs command: "btrfs inspect physical-find"
  2016-07-27 17:43 ` [PATCH 2/5] New btrfs command: "btrfs inspect physical-find" Goffredo Baroncelli
@ 2016-07-28  1:47   ` Qu Wenruo
  2016-07-28 20:25     ` Goffredo Baroncelli
  0 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2016-07-28  1:47 UTC (permalink / raw)
  To: Goffredo Baroncelli, linux-btrfs
  Cc: dsterba, Chris Mason, Goffredo Baroncelli

At 07/28/2016 01:43 AM, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli <kreijack@inwind.it>
>
> The aim of this new command is to show the physical placement on the disk
> of a file.
> Currently it handles all the profiles (single, dup, raid1/10/5/6).
>
> The syntax is simple:

Uh...
Where is the synatx?

I guess the synatx is
physical-find <filename> [<offset>]

>
> where:
>   <filename> is the file to inspect
>   <offset> is the offset of the file to inspect (default 0)

Normally <offset> is paired with <length>.
What about add a new optional parameter <length>?
Its default value would be the length of the file.

And for the optional <offset>, would you mind to make it as an option?
like -o|--offset <offset> and -s|--size <size>?

For resolve logical directly, then -l|--logical <logical>.

>
> Below some examples:
>
> ** Single
>
> $ sudo mkfs.btrfs -f -d single -m single /dev/loop0
> $ sudo mount /dev/loop0 mnt/
> $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
> mnt/out.txt: 0

So 0 is the offset inside the file.
And how long that file extent is?

>         devid: 1 dev_name: /dev/loop0 offset: 12582912 type: LINEAR

LINEAR seems a little different, as normally we just call it SINGLE in 
btrfs.

And what about changing the output format to the following?
(This combines both fiemap style and map-logical style)
------
EXT: FILE-OFFSET LOGICAL RANGE DEVICE       DEVICE RANG   TYPE
0:   0-128K      XXXXX-XXXXX   1:/dev/loop0 XXXXX-XXXXX   RAID1
                                2:/dev/loop1 XXXXX-XXXXX   RAID1
1:   128K-256K   XXXXX-XXXXX   1:/dev/loop2 XXXXX-XXXXX   RAID5D1
                                1:/dev/loop3 XXXXX-XXXXX   RAID5D2
                                1:/dev/loop4 XXXXX-XXXXX   RAID5P
                  XXXXX-XXXXX   1:/dev/loop2 XXXXX-XXXXX   RAID5D1
                                1:/dev/loop3 XXXXX-XXXXX   RAID5D2
                                1:/dev/loop4 XXXXX-XXXXX   RAID5P
------
Extent 0 and 1 are in different raid profile, it's only possible during 
convert, just used as an exmple
And Extent 1 are crossing 2 RAID5 stripes, so needs 2 logical range to 
show them all.


Although it's quite hard to put the above output into 80 characters per 
line, it provides almost every info we need:
1) File offset and its length
2) Logical bytenr and its length
3) Device bytenr and its length (since its length can differ from 
logical length)
4) RAID type and its role.


> $ dd 2>/dev/null if=/dev/loop0 skip=12582912 bs=1 count=5; echo
> adaaa
>
> ** Dup
>
> The command shows both the copies
>
> $ sudo mkfs.btrfs -f -d single -m single /dev/loop0
> $ sudo mount /dev/loop0 mnt/
> $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
> mnt/out.txt: 0
>         devid: 1 dev_name: /dev/loop0 offset: 71303168 type: DUP
>         devid: 1 dev_name: /dev/loop0 offset: 104857600 type: DUP
> $ dd 2>/dev/null if=/dev/loop0 skip=104857600 bs=1 count=5 ; echo
> adaaa
>
> ** Raid1
>
> The command shows both the copies
>
> $ sudo mkfs.btrfs -f -d raid1 -m raid1 /dev/loop0 /dev/loop1
> $ sudo mount /dev/loop0 mnt/
> $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0
>         devid: 2 dev_name: /dev/loop1 offset: 61865984 type: RAID1
>         devid: 1 dev_name: /dev/loop0 offset: 81788928 type: RAID1
> $ dd 2>/dev/null if=/dev/loop0 skip=81788928 bs=1 count=5; echo
> adaaa
>
> ** Raid10
>
> The command show both the copies; if you set an offset to the next disk-stripe, you can see the next pair of disk-stripe
>
> $ sudo mkfs.btrfs -f -d raid10 -m raid10 /dev/loop[0123]
> $ sudo mount /dev/loop0 mnt/
> $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0
>         devid: 4 dev_name: /dev/loop3 offset: 61931520 type: RAID10
>         devid: 3 dev_name: /dev/loop2 offset: 61931520 type: RAID10
> $ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5; echo
> adaaa
> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt 65536
> mnt/out.txt: 65536
>         devid: 2 dev_name: /dev/loop1 offset: 61931520 type: RAID10
>         devid: 1 dev_name: /dev/loop0 offset: 81854464 type: RAID10
> $ dd 2>/dev/null if=/dev/loop0 skip=81854464 bs=1 count=5; echo
> bdbbb
>
> ** Raid5
>
> Depending by the offset, you can see which disk-stripe is used.
>
> $ sudo mkfs.btrfs -f -d raid5 -m raid5 /dev/loop[012]
> $ sudo mount /dev/loop0 mnt/
> $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
> mnt/out.txt: 0
>         devid: 2 dev_name: /dev/loop1 offset: 61931520 type: DATA
>         devid: 1 dev_name: /dev/loop0 offset: 81854464 type: OTHER
>         devid: 3 dev_name: /dev/loop2 offset: 61931520 type: PARITY

Here DATA/OTHER is a little confusing.
For 4 disks raid5, will it be DATA/OTHER/OTHER and PARITY?

What about RAID5D1 for the first data stripe and RAID5D2 for the second?

So for 4 disks raid5, it will be RAID5D1/D2/D3 and RAID5P (RAID5 PARITY)
And it's also confusing compared to RAID6.

> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt 65536mnt/out.txt: 65536
>         devid: 2 dev_name: /dev/loop1 offset: 61931520 type: OTHER
>         devid: 1 dev_name: /dev/loop0 offset: 81854464 type: DATA
>         devid: 3 dev_name: /dev/loop2 offset: 61931520 type: PARITY
> $ dd 2>/dev/null if=/dev/loop1 skip=61931520 bs=1 count=5; echo
> adaaa
> $ dd 2>/dev/null if=/dev/loop0 skip=81854464 bs=1 count=5; echo
> bdbbb
> $ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5 | xxd
> 00000000: 0300 0303 03                             .....
>
> The parity is computed as: parity=disk1^disk2. So "adaa" ^ "bdbb" == "\x03\x00\x03\x03
>
> ** Raid6
> $ sudo mkfs.btrfs -f -mraid6 -draid6 /dev/loop[0-4]^C
> $ sudo mount /dev/loop0 mnt/
> $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
> mnt/out.txt: 0
>         devid: 3 dev_name: /dev/loop2 offset: 61931520 type: DATA
>         devid: 2 dev_name: /dev/loop1 offset: 61931520 type: OTHER
>         devid: 1 dev_name: /dev/loop0 offset: 81854464 type: PARITY
>         devid: 4 dev_name: /dev/loop3 offset: 61931520 type: PARITY

Same like RAID5.
IMHO RAID6D1/D2... and RAID6P RAID6Q seems better for me.
>
> $ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5 ; echo
> adaaa
>
>
> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
> ---
>  cmds-inspect.c | 587 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 587 insertions(+)
>
> diff --git a/cmds-inspect.c b/cmds-inspect.c
> index dd7b9dd..fc2e7c3 100644
> --- a/cmds-inspect.c
> +++ b/cmds-inspect.c
> @@ -22,6 +22,11 @@
>  #include <errno.h>
>  #include <getopt.h>
>  #include <limits.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +#include <fcntl.h>
> +#include <linux/fs.h>
> +#include <linux/fiemap.h>
>
>  #include "kerncompat.h"
>  #include "ioctl.h"
> @@ -623,6 +628,586 @@ out:
>  	return !!ret;
>  }
>
> +
> +static const char * const cmd_inspect_physical_find_usage[] = {
> +	"btrfs inspect-internal physical-find <path> [<off>|-l <logical>]",
> +	"Show the physical placement of a file data.",
> +	"<path>     file to show",

For resolving logical directly, not the file but any file/dir inside the 
fs though.

> +	"<off>      file offset to show; 0 if not specified",
> +	"<logical>  show info about a logical address instead of a file",

Mentioned above, use -o|-s|-l options seems to be a better solution.

> +	"This command requires root privileges",
> +	NULL
> +};
> +
> +#define STRIPE_INFO_LINEAR		1
> +#define STRIPE_INFO_DUP			2
> +#define STRIPE_INFO_RAID0		3
> +#define STRIPE_INFO_RAID1		4
> +#define STRIPE_INFO_RAID10		5
> +#define STRIPE_INFO_RAID56_DATA		6
> +#define STRIPE_INFO_RAID56_OTHER	7
> +#define STRIPE_INFO_RAID56_PARITY	8

Mentioned before.
And since the STRIPE_INFO_* macro is only used in outputting the string, 
I prefer to do it in a helper function with if branches.

> +
> +static const char * const stripe_info_descr[] = {
> +	[STRIPE_INFO_LINEAR] = "LINEAR",
> +	[STRIPE_INFO_DUP] = "DUP",
> +	[STRIPE_INFO_RAID0] = "RAID0",
> +	[STRIPE_INFO_RAID1] = "RAID1",
> +	[STRIPE_INFO_RAID10] = "RAID10",
> +	[STRIPE_INFO_RAID56_DATA] = "DATA",
> +	[STRIPE_INFO_RAID56_OTHER] = "OTHER",
> +	[STRIPE_INFO_RAID56_PARITY] = "PARITY",
> +};
> +
> +struct stripe_info {
> +	u64 devid;
> +	const char *dname;
> +	u64 phy_start;
> +	int type;

IMHO "dname" contains all the neede info for the role of the stripe.
So "type" is useless here for me though.

And it's better to add a "u32 phy_length" to show how long the stripe is.

> +};
> +
> +static void add_stripe_info(struct stripe_info **list, int *count,
> +	u64 devid, const char *dname, u64 phy_start, int type) {
> +
> +	if (*list == NULL)
> +		*count = 0;
> +
> +	++*count;
> +	*list = realloc(*list, sizeof(struct stripe_info) * *count);
> +	/*
> +	 * It is rude, but it should not happen for this kind of allocation...
> +	 * ... and anyway when it happens, there are more severe problems
> +	 * that this handling of "not enough memory"
> +	 */
> +	if (*list == NULL) {
> +		error("Not nough memory: abort\n");
> +		exit(100);

Same exit value problem here.

> +	}
> +
> +	(*list)[*count-1].devid = devid;
> +	(*list)[*count-1].dname = dname;
> +	(*list)[*count-1].phy_start = phy_start;
> +	(*list)[*count-1].type = type;
> +}
> +
> +static void dump_stripes(int ndisks, struct btrfs_ioctl_dev_info_args *disks,
> +			 struct btrfs_chunk *chunk, u64 logical_start,
> +			 struct stripe_info **stripes_ret, int *stripes_count) {
> +	struct btrfs_stripe *stripes;
> +
> +	stripes = &chunk->stripe;
> +
> +	if ((chunk->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0) {
> +		/* LINEAR: each chunk has (should have) only one disk */
> +		int j;
> +		char *dname = "<NOT FOUND>";
> +
> +		assert(chunk->num_stripes == 1);
> +
> +		u64 phy_start = stripes[0].offset +
> +			+logical_start;
> +		for (j = 0 ; j < ndisks ; j++) {
> +			if (stripes[0].devid == disks[j].devid) {
> +				dname = (char *)disks[j].path;
> +				break;
> +			}
> +		}
> +
> +		add_stripe_info(stripes_ret, stripes_count,
> +			stripes[0].devid, dname, phy_start,
> +			STRIPE_INFO_LINEAR);
> +	} else if (chunk->type & BTRFS_BLOCK_GROUP_RAID0) {
> +		/*
> +		 * RAID0: each chunk is composed by more disks;
> +		 * each stripe_len bytes are in a different disk:
> +		 *
> +		 *  file: ABC...NMOP....
> +		 *
> +		 *      disk1   disk2    disk3  .... disksN
> +		 *
> +		 *        A      B         C    ....    N
> +		 *        M      O         P    ....
> +		 *
> +		 */
> +		u64 disks_number = chunk->num_stripes;
> +		u64 disk_stripe_size = chunk->stripe_len;
> +		u64 stripe_capacity;
> +		u64 stripe_nr;
> +		u64 disk_stripe_start;
> +		int sidx;
> +		int j;
> +		char *dname = "<NOT FOUND>";
> +
> +		stripe_capacity = disks_number * disk_stripe_size;
> +		stripe_nr = logical_start / stripe_capacity;
> +		disk_stripe_start = logical_start % disk_stripe_size;
> +
> +		sidx = (logical_start / disk_stripe_size) % disks_number;
> +
> +		u64 phy_start = stripes[sidx].offset +
> +			stripe_nr * disk_stripe_size +
> +			disk_stripe_start;
> +
> +		for (j = 0 ; j < ndisks ; j++) {
> +			if (stripes[sidx].devid == disks[j].devid) {
> +				dname = (char *)disks[j].path;
> +				break;
> +			}
> +		}
> +
> +		add_stripe_info(stripes_ret, stripes_count,
> +			stripes[sidx].devid, dname, phy_start,
> +			STRIPE_INFO_RAID0);
> +
> +	} else if (chunk->type & BTRFS_BLOCK_GROUP_RAID1) {
> +		/*
> +		 * RAID0: each chunk is composed by more disks;
> +		 * each stripe_len bytes are in a different disk:
> +		 *
> +		 *  file: ABC...
> +		 *
> +		 *      disk1   disk2   disk3  ....
> +		 *
> +		 *        A       A
> +		 *        B       B
> +		 *        C       C
> +		 *
> +		 */

Here btrfs raid1 is more flex than normal RAID1 implement.

Better comment would be:
  Disk1   Disk2   Disk3
   A       A       B
   B       C       C

And that's the real case for 3 disks RAID1 (for same disk size).
> +		int sidx;
> +
> +		for (sidx = 0; sidx < chunk->num_stripes; sidx++) {
> +			int j;
> +			char *dname = "<NOT FOUND>";
> +			u64 phy_start = stripes[sidx].offset +
> +				+logical_start;
> +
> +			for (j = 0 ; j < ndisks ; j++) {
> +				if (stripes[sidx].devid == disks[j].devid) {
> +					dname = (char *)disks[j].path;
> +					break;
> +				}
> +			}
> +			add_stripe_info(stripes_ret, stripes_count,
> +				stripes[sidx].devid, dname, phy_start,
> +				STRIPE_INFO_RAID1);
> +		}
> +
> +	} else if (chunk->type & BTRFS_BLOCK_GROUP_DUP) {
> +		/*
> +		 * DUP: each chunk has 'num_stripes' disk_stripe. Heach
> +		 * disk_stripe has its own copy of data
> +		 *
> +		 *  file: ABCD....
> +		 *
> +		 *      disk1   disk2   disk3
> +		 *
> +		 *        A
> +		 *        B
> +		 *        C
> +		 *      [...]
> +		 *        A
> +		 *        B
> +		 *        C
> +		 *
> +		 *
> +		 * NOTE: the difference between DUP and RAID1 is that
> +		 * in RAID1 each disk_stripe is in a different disk, in DUP
> +		 * each disk chunk is in the same disk
> +		 */
> +		int sidx;
> +
> +		for (sidx = 0; sidx < chunk->num_stripes; sidx++) {
> +			int j;
> +			char *dname = "<NOT FOUND>";
> +			u64 phy_start = stripes[sidx].offset +
> +				+logical_start;
> +
> +			for (j = 0 ; j < ndisks ; j++) {
> +				if (stripes[sidx].devid == disks[j].devid) {
> +					dname = (char *)disks[j].path;
> +					break;
> +				}
> +			}
> +
> +			add_stripe_info(stripes_ret, stripes_count,
> +				stripes[sidx].devid, dname, phy_start,
> +				STRIPE_INFO_DUP);
> +		}
> +	} else if (chunk->type & BTRFS_BLOCK_GROUP_RAID10) {
> +		/*
> +		 * RAID10: each chunk is composed by more disks;
> +		 * each stripe_len bytes are in a different disk:
> +		 *
> +		 *  file: ABCD....
> +		 *
> +		 *      disk1   disk2   disk3   disk4
> +		 *
> +		 *        A      A         B      B
> +		 *        C      C         D      D
> +		 *
> +		 *
> +		 */
> +		int i;
> +		u64 disks_number = chunk->num_stripes;
> +		u64 disk_stripe_size = chunk->stripe_len;
> +		u64 stripe_capacity;
> +		u64 stripe_nr;
> +		u64 stripe_start;
> +		u64 disk_stripe_start;
> +
> +		stripe_capacity = disks_number * disk_stripe_size / chunk->sub_stripes;
> +		stripe_nr = logical_start / stripe_capacity;
> +		stripe_start = logical_start % stripe_capacity;
> +		disk_stripe_start = logical_start % disk_stripe_size;
> +
> +		for (i = 0; i < chunk->sub_stripes; i++) {
> +			int j;
> +			char *dname = "<NOT FOUND>";
> +			int sidx = (i +
> +				stripe_start/disk_stripe_size*chunk->sub_stripes) %
> +				disks_number;
> +
> +			u64 phy_start = stripes[sidx].offset +
> +				+stripe_nr*disk_stripe_size + disk_stripe_start;
> +
> +			for (j = 0 ; j < ndisks ; j++) {
> +				if (stripes[sidx].devid == disks[j].devid) {
> +					dname = (char *)disks[j].path;
> +					break;
> +				}
> +			}
> +
> +			add_stripe_info(stripes_ret, stripes_count,
> +				stripes[sidx].devid, dname, phy_start,
> +				STRIPE_INFO_RAID10);
> +
> +		}
> +	} else if (chunk->type & BTRFS_BLOCK_GROUP_RAID5 ||
> +			chunk->type & BTRFS_BLOCK_GROUP_RAID6) {
> +		/*
> +		 * RAID5: each chunk is spread on a different disk; however one
> +		 * disk is used for parity
> +		 *
> +		 *  file: ABCDEFGHIJK....
> +		 *
> +		 *      disk1  disk2  disk3  disk4  disk5
> +		 *
> +		 *        A      B      C      D      P
> +		 *        P      D      E      F      G
> +		 *        H      P      I      J      K
> +		 *
> +		 *   Note: P == parity
> +		 *
> +		 * RAID6: each chunk is spread on a different disk; however two
> +		 * disks are used for parity
> +		 *
> +		 *  file: ABCDEFGHI...
> +		 *
> +		 *      disk1  disk2  disk3  disk4  disk5
> +		 *
> +		 *        A      B      C      P      Q
> +		 *        Q      D      E      F      P
> +		 *        P      Q      G      H      I
> +		 *
> +		 *   Note: P,Q == parity
> +		 *
> +		 */
> +		int parities_nr = 1;
> +		u64 disks_number = chunk->num_stripes;
> +		u64 disk_stripe_size = chunk->stripe_len;
> +		u64 stripe_capacity;
> +		u64 stripe_nr;
> +		u64 stripe_start;
> +		u64 pos = 0;
> +		u64 disk_stripe_start;
> +		int sidx;
> +
> +		if (chunk->type & BTRFS_BLOCK_GROUP_RAID6)
> +			parities_nr = 2;
> +
> +		stripe_capacity = (disks_number - parities_nr) *
> +						disk_stripe_size;
> +		stripe_nr = logical_start / stripe_capacity;
> +		stripe_start = logical_start % stripe_capacity;
> +		disk_stripe_start = logical_start % disk_stripe_size;
> +
> +		for (sidx = 0; sidx < disks_number ; sidx++) {
> +			int j;
> +			char *dname = "<NOT FOUND>";
> +			u64 stripe_index = (sidx + stripe_nr) % disks_number;
> +			u64 phy_start = stripes[stripe_index].offset + /* chunk start */
> +				+ stripe_nr*disk_stripe_size +  /* stripe start */
> +				+ disk_stripe_start;
> +
> +			for (j = 0 ; j < ndisks ; j++)
> +				if (stripes[stripe_index].devid == disks[j].devid) {
> +				dname = (char *)disks[j].path;
> +				break;
> +				}
> +
> +			if (sidx >= (disks_number - parities_nr)) {
> +				add_stripe_info(stripes_ret, stripes_count,
> +					stripes[stripe_index].devid, dname, phy_start,
> +					STRIPE_INFO_RAID56_PARITY);
> +				continue;
> +			}
> +
> +			if (stripe_start >= pos && stripe_start < (pos+disk_stripe_size)) {
> +				add_stripe_info(stripes_ret, stripes_count,
> +					stripes[stripe_index].devid, dname, phy_start,
> +					STRIPE_INFO_RAID56_DATA);
> +			} else {
> +				add_stripe_info(stripes_ret, stripes_count,
> +					stripes[stripe_index].devid, dname, phy_start,
> +					STRIPE_INFO_RAID56_OTHER);
> +			}
> +
> +			pos += disk_stripe_size;
> +		}
> +		assert(pos == stripe_capacity);
> +	} else {
> +		error("Unknown chunk type = 0x%016llx\n", chunk->type);
> +		return;
> +	}
> +
> +}
> +
> +static int get_chunk_offset(int fd, u64 logical_start,
> +	struct btrfs_chunk *chunk_ret, u64 *off_ret) {
> +
> +	struct btrfs_ioctl_search_args args;
> +	struct btrfs_ioctl_search_key *sk = &args.key;
> +	struct btrfs_ioctl_search_header sh;
> +	unsigned long off = 0;
> +	int i;
> +
> +	memset(&args, 0, sizeof(args));
> +	sk->tree_id = BTRFS_CHUNK_TREE_OBJECTID;
> +	sk->min_objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
> +	sk->max_objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
> +	sk->min_type = BTRFS_CHUNK_ITEM_KEY;
> +	sk->max_type = BTRFS_CHUNK_ITEM_KEY;
> +	sk->max_offset = (u64)-1;
> +	sk->min_offset = 0;
> +	sk->max_transid = (u64)-1;
> +
> +	while (1) {
> +		int ret;
> +
> +		sk->nr_items = 1;
> +		ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, &args);
> +		if (ret < 0)
> +			return -errno;
> +
> +		if (sk->nr_items == 0)
> +			break;
> +
> +		off = 0;
> +		for (i = 0; i < sk->nr_items; i++) {
> +			struct btrfs_chunk *item;
> +
> +			memcpy(&sh, args.buf + off, sizeof(sh));
> +			off += sizeof(sh);
> +			item = (struct btrfs_chunk *)(args.buf + off);
> +			off += sh.len;
> +
> +			if (logical_start >= sh.offset &&
> +			    logical_start < sh.offset+item->length) {
> +				memcpy(chunk_ret, item, sh.len);
> +				*off_ret = logical_start-sh.offset;
> +				return 0;
> +			}
> +
> +			sk->min_objectid = sh.objectid;
> +			sk->min_type = sh.type;
> +			sk->min_offset = sh.offset;
> +		}
> +
> +		if (sk->min_offset < (u64)-1)
> +			sk->min_offset++;
> +		else
> +			break;
> +	}
> +
> +	return 1; /* not found */
> +}
> +
> +/*
> + * Inline extents are skipped because they do not take data space,
> + * delalloc and unknown are skipped because we do not know how much
> + * space they will use yet.
> + */
> +#define SKIP_FLAGS	(FIEMAP_EXTENT_UNKNOWN|FIEMAP_EXTENT_DELALLOC| \
> +			 FIEMAP_EXTENT_DATA_INLINE)
> +
> +static int get_file_offset(int fd, u64 file_offset, u64 *logical)
> +{
> +	char buf[16384];
> +	struct fiemap *fiemap = (struct fiemap *)buf;
> +	struct fiemap_extent *fm_ext;
> +	const int count = (sizeof(buf) - sizeof(*fiemap)) /
> +					sizeof(struct fiemap_extent);
> +	int last = 0;
> +
> +
> +	memset(fiemap, 0, sizeof(struct fiemap));
> +
> +	do {
> +
> +		int rc;
> +		int j;
> +
> +		fiemap->fm_length = ~0ULL;
> +		fiemap->fm_extent_count = count;
> +		fiemap->fm_flags = FIEMAP_FLAG_SYNC;
> +		rc = ioctl(fd, FS_IOC_FIEMAP, (unsigned long) fiemap);
> +		if (rc < 0)
> +			return -errno;
> +
> +		for (j = 0; j < fiemap->fm_mapped_extents; j++) {
> +			u32 flags;
> +
> +			fm_ext = &fiemap->fm_extents[j];
> +			flags = fm_ext->fe_flags;
> +
> +			fiemap->fm_start = (fm_ext->fe_logical +
> +					fm_ext->fe_length);
> +
> +			if (flags & FIEMAP_EXTENT_LAST)
> +				last = 1;
> +
> +			if (flags & SKIP_FLAGS)
> +				continue;
> +
> +			if (file_offset > fm_ext->fe_logical +
> +			fm_ext->fe_length)
> +				continue;
> +
> +			*logical = fm_ext->fe_physical + file_offset -
> +				   fm_ext->fe_logical;
> +			return 0;
> +		}
> +	} while (last == 0);
> +
> +	return 1;
> +}
> +static int cmd_inspect_physical_find(int argc, char **argv)
> +{
> +	int ret = 0;
> +	int fd = -1;
> +	char *fname;
> +	struct btrfs_ioctl_dev_info_args *disks = NULL;
> +	struct btrfs_ioctl_fs_info_args fi_args = {0};
> +	char btrfs_chunk_data[4096];
> +	struct btrfs_chunk *chunk_item = (struct btrfs_chunk *)&btrfs_chunk_data;
> +	u64 chunk_offset = 0;
> +	struct stripe_info *stripes = NULL;
> +	int stripes_count = 0;
> +	int i;
> +	int rc;
> +	const char *logical_arg = NULL;
> +	u64 logical = 0ull;
> +
> +
> +	optind = 1;
> +	while (1) {
> +		int c = getopt(argc, argv, "l:");
> +
> +		if (c < 0)
> +			break;
> +
> +		switch (c) {
> +		case 'l':
> +			logical_arg = optarg;
> +			break;
> +		default:
> +			usage(cmd_inspect_physical_find_usage);
> +		}
> +	}
> +
> +	if ((logical_arg != NULL && check_argc_exact(argc - optind, 1)) ||
> +	    (check_argc_min(argc - optind, 1) || check_argc_max(argc - optind, 2)))
> +		usage(cmd_inspect_physical_find_usage);
> +
> +	fname = argv[optind];
> +
> +	check_root_or_exit();
> +	check_btrfs_or_exit(fname);

If we call get_fs_info(), is it really needed to check btrfs early?

Thanks,
Qu
> +
> +	fd = open(fname, O_RDONLY);
> +	if (fd < 0) {
> +		error("Can't open '%s' for reading\n", fname);
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	if (logical_arg == NULL) {
> +		u64 file_offset = 0ull;
> +
> +		if (argc - optind == 2)
> +			file_offset = strtoull(argv[optind+1], NULL, 0);
> +		ret = get_file_offset(fd, file_offset, &logical);
> +		if (ret > 0) {
> +			error("Can't find the extent: the file is too short, or the file is stored in a leaf.\n");
> +			ret = 10;
> +			goto out;
> +		} else if (ret < 0) {
> +			int e = -ret;
> +
> +			error("Can't do ioctl() [errno=%d: %s]\n", e, strerror(e));
> +			ret = 11;
> +			goto out;
> +		}
> +
> +		printf("logical: %llu offset: %llu file: %s\n", logical,
> +		       file_offset, fname);
> +	} else {
> +		logical = strtoull(logical_arg, NULL, 0);
> +		printf("logical: %llu filesystem: %s\n", logical, fname);
> +	}
> +
> +	rc = get_fs_info(fname, &fi_args, &disks);
> +	if (rc < 0) {
> +		error("Cannot get info for the filesystem: may be it is not a btrfs filesystem ?\n");
> +		ret = 12;
> +		goto out;
> +	}
> +
> +	rc = get_chunk_offset(fd,
> +		logical,
> +		chunk_item, &chunk_offset);
> +	if (rc < 0) {
> +		error("cannot perform the search: %s", strerror(rc));
> +		ret = 13;
> +		goto out;
> +	}
> +	if (rc != 0) {
> +		error("cannot find chunk\n");
> +		ret = 14;
> +		goto out;
> +	}
> +
> +	dump_stripes(fi_args.num_devices, disks,
> +		     chunk_item, chunk_offset,
> +		     &stripes, &stripes_count);
> +
> +	for (i = 0 ; i < stripes_count ; i++) {
> +		printf("devid: %llu dev_name: %s offset: %llu type: %s\n",
> +			stripes[i].devid, stripes[i].dname,
> +			stripes[i].phy_start,
> +			stripe_info_descr[stripes[i].type]);
> +	}
> +
> +out:
> +	if (fd != -1)
> +		close(fd);
> +	if (disks != NULL)
> +		free(disks);
> +	if (stripes != NULL)
> +		free(stripes);
> +	return ret;
> +}
> +
>  static const char inspect_cmd_group_info[] =
>  "query various internal information";
>
> @@ -644,6 +1229,8 @@ const struct cmd_group inspect_cmd_group = {
>  				cmd_inspect_dump_super_usage, NULL, 0 },
>  		{ "tree-stats", cmd_inspect_tree_stats,
>  				cmd_inspect_tree_stats_usage, NULL, 0 },
> +		{ "physical-find", cmd_inspect_physical_find,
> +				cmd_inspect_physical_find_usage, NULL, 0 },
>  		NULL_CMD_STRUCT
>  	}
>  };
>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BTRFS-PROGS][PATCH][V2] Add two new commands: 'btrfs insp physical-find' and 'btrfs insp physical-dump'
  2016-07-27 17:43 [BTRFS-PROGS][PATCH][V2] Add two new commands: 'btrfs insp physical-find' and 'btrfs insp physical-dump' Goffredo Baroncelli
                   ` (4 preceding siblings ...)
  2016-07-27 17:43 ` [PATCH 5/5] Add new command to man pages: btrfs insp physical-dump Goffredo Baroncelli
@ 2016-07-28 12:03 ` David Sterba
  5 siblings, 0 replies; 16+ messages in thread
From: David Sterba @ 2016-07-28 12:03 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: linux-btrfs, dsterba, Chris Mason, Qu Wenruo

On Wed, Jul 27, 2016 at 07:43:13PM +0200, Goffredo Baroncelli wrote:
> Hi all,
> 
> the following patches add two new commands: 
> 1) btrfs inspect-internal physical-find
> 2) btrfs inspect-internal physical-dump
> 
> The aim of these two new commands is to locate (1) and dump (2) the stripe elements
> stored on the disks. I developed these two new command to simplify the
> debugging of some RAID5 bugs (but this is another discussion).
[...]

Overall it looks good to me, I'll consider the series for 4.8 release
but I won't have time to review it closely in following weeks. Qu's
comments go in the same direction I'd comment so please address them in
the next patch iteration. Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/5] New btrfs command: "btrfs inspect physical-find"
  2016-07-28  1:47   ` Qu Wenruo
@ 2016-07-28 20:25     ` Goffredo Baroncelli
  2016-07-29  1:34       ` Qu Wenruo
  0 siblings, 1 reply; 16+ messages in thread
From: Goffredo Baroncelli @ 2016-07-28 20:25 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: dsterba, Chris Mason, Goffredo Baroncelli

Hi Qu,

On 2016-07-28 03:47, Qu Wenruo wrote:
> At 07/28/2016 01:43 AM, Goffredo Baroncelli wrote:
>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>
>> The aim of this new command is to show the physical placement on the disk
>> of a file.
>> Currently it handles all the profiles (single, dup, raid1/10/5/6).
>>
>> The syntax is simple:
> 
> Uh...
> Where is the synatx?

:-)

The syntax is:

btrfs inspect-internal physical-find <filename> [-l <logical>|<offset>]

> 
> I guess the synatx is
> physical-find <filename> [<offset>]
> 
>>
>> where:
>>   <filename> is the file to inspect
>>   <offset> is the offset of the file to inspect (default 0)
> 
> Normally <offset> is paired with <length>.
> What about add a new optional parameter <length>?

See my next comment

> Its default value would be the length of the file.
> 
> And for the optional <offset>, would you mind to make it as an option?
> like -o|--offset <offset> and -s|--size <size>?
> 
> For resolve logical directly, then -l|--logical <logical>.
> 
>>
>> Below some examples:
>>
>> ** Single
>>
>> $ sudo mkfs.btrfs -f -d single -m single /dev/loop0
>> $ sudo mount /dev/loop0 mnt/
>> $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
>> mnt/out.txt: 0
> 
> So 0 is the offset inside the file.
> And how long that file extent is?
> 
>>         devid: 1 dev_name: /dev/loop0 offset: 12582912 type: LINEAR
> 
> LINEAR seems a little different, as normally we just call it SINGLE in btrfs.

Right

> 
> And what about changing the output format to the following?
> (This combines both fiemap style and map-logical style)
> ------
> EXT: FILE-OFFSET LOGICAL RANGE DEVICE       DEVICE RANG   TYPE
> 0:   0-128K      XXXXX-XXXXX   1:/dev/loop0 XXXXX-XXXXX   RAID1
>                                2:/dev/loop1 XXXXX-XXXXX   RAID1
> 1:   128K-256K   XXXXX-XXXXX   1:/dev/loop2 XXXXX-XXXXX   RAID5D1
>                                1:/dev/loop3 XXXXX-XXXXX   RAID5D2
>                                1:/dev/loop4 XXXXX-XXXXX   RAID5P
>                  XXXXX-XXXXX   1:/dev/loop2 XXXXX-XXXXX   RAID5D1
>                                1:/dev/loop3 XXXXX-XXXXX   RAID5D2
>                                1:/dev/loop4 XXXXX-XXXXX   RAID5P
> ------
> Extent 0 and 1 are in different raid profile, it's only possible during convert, just used as an exmple
> And Extent 1 are crossing 2 RAID5 stripes, so needs 2 logical range to show them all.

This is "quite clear" from an human point of view. But is a nightmare for a script to parse.. And what is missing is
something like "RAID5U" (U==unrelated) for element of the stripe but not of the file

 
> 
> Although it's quite hard to put the above output into 80 characters per line, it provides almost every info we need:
> 1) File offset and its length
> 2) Logical bytenr and its length
> 3) Device bytenr and its length (since its length can differ from logical length)
> 4) RAID type and its role.

I am not against about your proposal; however I have to point out that the goal of these command was not to *traverse* the file, but only to found the physical location of a file offset. My use case was to simulate a corruption of a raid5 stripe elements: for me it was sufficient to know the page position.

If you want these information to automate a test, I think that the range concept is more a problem than an help.

I suggest to add a third command (btrfs insp ranges ?) which show what are you looking.

> 
> 
>> $ dd 2>/dev/null if=/dev/loop0 skip=12582912 bs=1 count=5; echo
>> adaaa
>>
>> ** Dup
>>
>> The command shows both the copies
>>
>> $ sudo mkfs.btrfs -f -d single -m single /dev/loop0
>> $ sudo mount /dev/loop0 mnt/
>> $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
>> mnt/out.txt: 0
>>         devid: 1 dev_name: /dev/loop0 offset: 71303168 type: DUP
>>         devid: 1 dev_name: /dev/loop0 offset: 104857600 type: DUP
>> $ dd 2>/dev/null if=/dev/loop0 skip=104857600 bs=1 count=5 ; echo
>> adaaa
>>
>> ** Raid1
>>
>> The command shows both the copies
>>
>> $ sudo mkfs.btrfs -f -d raid1 -m raid1 /dev/loop0 /dev/loop1
>> $ sudo mount /dev/loop0 mnt/
>> $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0
>>         devid: 2 dev_name: /dev/loop1 offset: 61865984 type: RAID1
>>         devid: 1 dev_name: /dev/loop0 offset: 81788928 type: RAID1
>> $ dd 2>/dev/null if=/dev/loop0 skip=81788928 bs=1 count=5; echo
>> adaaa
>>
>> ** Raid10
>>
>> The command show both the copies; if you set an offset to the next disk-stripe, you can see the next pair of disk-stripe
>>
>> $ sudo mkfs.btrfs -f -d raid10 -m raid10 /dev/loop[0123]
>> $ sudo mount /dev/loop0 mnt/
>> $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0
>>         devid: 4 dev_name: /dev/loop3 offset: 61931520 type: RAID10
>>         devid: 3 dev_name: /dev/loop2 offset: 61931520 type: RAID10
>> $ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5; echo
>> adaaa
>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt 65536
>> mnt/out.txt: 65536
>>         devid: 2 dev_name: /dev/loop1 offset: 61931520 type: RAID10
>>         devid: 1 dev_name: /dev/loop0 offset: 81854464 type: RAID10
>> $ dd 2>/dev/null if=/dev/loop0 skip=81854464 bs=1 count=5; echo
>> bdbbb
>>
>> ** Raid5
>>
>> Depending by the offset, you can see which disk-stripe is used.
>>
>> $ sudo mkfs.btrfs -f -d raid5 -m raid5 /dev/loop[012]
>> $ sudo mount /dev/loop0 mnt/
>> $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
>> mnt/out.txt: 0
>>         devid: 2 dev_name: /dev/loop1 offset: 61931520 type: DATA
>>         devid: 1 dev_name: /dev/loop0 offset: 81854464 type: OTHER
>>         devid: 3 dev_name: /dev/loop2 offset: 61931520 type: PARITY
> 
> Here DATA/OTHER is a little confusing.
> For 4 disks raid5, will it be DATA/OTHER/OTHER and PARITY?
> 
> What about RAID5D1 for the first data stripe and RAID5D2 for the second?

And what about a data-stripe which is not related to the file which we are examining ?




> 
> So for 4 disks raid5, it will be RAID5D1/D2/D3 and RAID5P (RAID5 PARITY)
> And it's also confusing compared to RAID6.
> 
>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt 65536mnt/out.txt: 65536
>>         devid: 2 dev_name: /dev/loop1 offset: 61931520 type: OTHER
>>         devid: 1 dev_name: /dev/loop0 offset: 81854464 type: DATA
>>         devid: 3 dev_name: /dev/loop2 offset: 61931520 type: PARITY
>> $ dd 2>/dev/null if=/dev/loop1 skip=61931520 bs=1 count=5; echo
>> adaaa
>> $ dd 2>/dev/null if=/dev/loop0 skip=81854464 bs=1 count=5; echo
>> bdbbb
>> $ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5 | xxd
>> 00000000: 0300 0303 03                             .....
>>
>> The parity is computed as: parity=disk1^disk2. So "adaa" ^ "bdbb" == "\x03\x00\x03\x03
>>
>> ** Raid6
>> $ sudo mkfs.btrfs -f -mraid6 -draid6 /dev/loop[0-4]^C
>> $ sudo mount /dev/loop0 mnt/
>> $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
>> mnt/out.txt: 0
>>         devid: 3 dev_name: /dev/loop2 offset: 61931520 type: DATA
>>         devid: 2 dev_name: /dev/loop1 offset: 61931520 type: OTHER
>>         devid: 1 dev_name: /dev/loop0 offset: 81854464 type: PARITY
>>         devid: 4 dev_name: /dev/loop3 offset: 61931520 type: PARITY
> 
> Same like RAID5.
> IMHO RAID6D1/D2... and RAID6P RAID6Q seems better for me.
>>
>> $ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5 ; echo
>> adaaa
>>
>>
>> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
>> ---
>>  cmds-inspect.c | 587 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 587 insertions(+)
>>
>> diff --git a/cmds-inspect.c b/cmds-inspect.c
>> index dd7b9dd..fc2e7c3 100644
>> --- a/cmds-inspect.c
>> +++ b/cmds-inspect.c
>> @@ -22,6 +22,11 @@
>>  #include <errno.h>
>>  #include <getopt.h>
>>  #include <limits.h>
>> +#include <sys/types.h>
>> +#include <sys/stat.h>
>> +#include <fcntl.h>
>> +#include <linux/fs.h>
>> +#include <linux/fiemap.h>
>>
>>  #include "kerncompat.h"
>>  #include "ioctl.h"
>> @@ -623,6 +628,586 @@ out:
>>      return !!ret;
>>  }
>>
>> +
>> +static const char * const cmd_inspect_physical_find_usage[] = {
>> +    "btrfs inspect-internal physical-find <path> [<off>|-l <logical>]",
>> +    "Show the physical placement of a file data.",
>> +    "<path>     file to show",
> 
> For resolving logical directly, not the file but any file/dir inside the fs though.

Ok
> 
>> +    "<off>      file offset to show; 0 if not specified",
>> +    "<logical>  show info about a logical address instead of a file",
> 
> Mentioned above, use -o|-s|-l options seems to be a better solution.

ok
> 
>> +    "This command requires root privileges",
>> +    NULL
>> +};
>> +
>> +#define STRIPE_INFO_LINEAR        1
>> +#define STRIPE_INFO_DUP            2
>> +#define STRIPE_INFO_RAID0        3
>> +#define STRIPE_INFO_RAID1        4
>> +#define STRIPE_INFO_RAID10        5
>> +#define STRIPE_INFO_RAID56_DATA        6
>> +#define STRIPE_INFO_RAID56_OTHER    7
>> +#define STRIPE_INFO_RAID56_PARITY    8
> 
> Mentioned before.
> And since the STRIPE_INFO_* macro is only used in outputting the string, I prefer to do it in a helper function with if branches.

ok

> 
>> +
>> +static const char * const stripe_info_descr[] = {
>> +    [STRIPE_INFO_LINEAR] = "LINEAR",
>> +    [STRIPE_INFO_DUP] = "DUP",
>> +    [STRIPE_INFO_RAID0] = "RAID0",
>> +    [STRIPE_INFO_RAID1] = "RAID1",
>> +    [STRIPE_INFO_RAID10] = "RAID10",
>> +    [STRIPE_INFO_RAID56_DATA] = "DATA",
>> +    [STRIPE_INFO_RAID56_OTHER] = "OTHER",
>> +    [STRIPE_INFO_RAID56_PARITY] = "PARITY",
>> +};
>> +
>> +struct stripe_info {
>> +    u64 devid;
>> +    const char *dname;
>> +    u64 phy_start;
>> +    int type;
> 
> IMHO "dname" contains all the neede info for the role of the stripe.
> So "type" is useless here for me though.

Sorry I can't understand you: dname is the device name; its role depends by 
several factors, so I add also the type field.

> 
> And it's better to add a "u32 phy_length" to show how long the stripe is.
> 
>> +};
>> +
>> +static void add_stripe_info(struct stripe_info **list, int *count,
>> +    u64 devid, const char *dname, u64 phy_start, int type) {
>> +
>> +    if (*list == NULL)
>> +        *count = 0;
>> +
>> +    ++*count;
>> +    *list = realloc(*list, sizeof(struct stripe_info) * *count);
>> +    /*
>> +     * It is rude, but it should not happen for this kind of allocation...
>> +     * ... and anyway when it happens, there are more severe problems
>> +     * that this handling of "not enough memory"
>> +     */
>> +    if (*list == NULL) {
>> +        error("Not nough memory: abort\n");
>> +        exit(100);
> 
> Same exit value problem here.

ok
[...]


>> +
>> +    } else if (chunk->type & BTRFS_BLOCK_GROUP_RAID1) {
>> +        /*
>> +         * RAID0: each chunk is composed by more disks;
>> +         * each stripe_len bytes are in a different disk:
>> +         *
>> +         *  file: ABC...
>> +         *
>> +         *      disk1   disk2   disk3  ....
>> +         *
>> +         *        A       A
>> +         *        B       B
>> +         *        C       C
>> +         *
>> +         */
> 
> Here btrfs raid1 is more flex than normal RAID1 implement.
> 
> Better comment would be:
>  Disk1   Disk2   Disk3
>   A       A       B
>   B       C       C

ok

> 
> And that's the real case for 3 disks RAID1 (for same disk size).


[...]
>> +static int cmd_inspect_physical_find(int argc, char **argv)
>> +{
>> +    int ret = 0;
[...]
>> +
>> +    check_root_or_exit();
>> +    check_btrfs_or_exit(fname);
> 
> If we call get_fs_info(), is it really needed to check btrfs early?

The two above are the mains reasons of failure of these command. So I 
preferred to add a clear check about which property we want.
I think that is more clear a statemenmt like:
	"You need to be root to execute this command"
instead of a generic EPERM: the user could think that it is
sufficent to change the permission of the file
	

> 
> Thanks,
> Qu

[...]

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/5] New btrfs command: "btrfs inspect physical-find"
  2016-07-28 20:25     ` Goffredo Baroncelli
@ 2016-07-29  1:34       ` Qu Wenruo
  2016-07-29  5:08         ` Goffredo Baroncelli
  0 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2016-07-29  1:34 UTC (permalink / raw)
  To: kreijack, linux-btrfs; +Cc: dsterba, Chris Mason

Hi, Goffredo,

Sorry I forgot to mention that, even btrfs-map-logcal is an offline 
tool, it can still handle mount fs too.

Although it's also true that it still lacks the needed RAID flags and 
stripe info.

At 07/29/2016 04:25 AM, Goffredo Baroncelli wrote:
> Hi Qu,
>
> On 2016-07-28 03:47, Qu Wenruo wrote:
>> At 07/28/2016 01:43 AM, Goffredo Baroncelli wrote:
>>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>>
>>> The aim of this new command is to show the physical placement on the disk
>>> of a file.
>>> Currently it handles all the profiles (single, dup, raid1/10/5/6).
>>>
>>> The syntax is simple:
>>
>> Uh...
>> Where is the synatx?
>
> :-)
>
> The syntax is:
>
> btrfs inspect-internal physical-find <filename> [-l <logical>|<offset>]
>
>>
>> I guess the synatx is
>> physical-find <filename> [<offset>]
>>
>>>
>>> where:
>>>   <filename> is the file to inspect
>>>   <offset> is the offset of the file to inspect (default 0)
>>
>> Normally <offset> is paired with <length>.
>> What about add a new optional parameter <length>?
>
> See my next comment
>
>> Its default value would be the length of the file.
>>
>> And for the optional <offset>, would you mind to make it as an option?
>> like -o|--offset <offset> and -s|--size <size>?
>>
>> For resolve logical directly, then -l|--logical <logical>.
>>
>>>
>>> Below some examples:
>>>
>>> ** Single
>>>
>>> $ sudo mkfs.btrfs -f -d single -m single /dev/loop0
>>> $ sudo mount /dev/loop0 mnt/
>>> $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
>>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
>>> mnt/out.txt: 0
>>
>> So 0 is the offset inside the file.
>> And how long that file extent is?
>>
>>>         devid: 1 dev_name: /dev/loop0 offset: 12582912 type: LINEAR
>>
>> LINEAR seems a little different, as normally we just call it SINGLE in btrfs.
>
> Right
>
>>
>> And what about changing the output format to the following?
>> (This combines both fiemap style and map-logical style)
>> ------
>> EXT: FILE-OFFSET LOGICAL RANGE DEVICE       DEVICE RANG   TYPE
>> 0:   0-128K      XXXXX-XXXXX   1:/dev/loop0 XXXXX-XXXXX   RAID1
>>                                2:/dev/loop1 XXXXX-XXXXX   RAID1
>> 1:   128K-256K   XXXXX-XXXXX   1:/dev/loop2 XXXXX-XXXXX   RAID5D1
>>                                1:/dev/loop3 XXXXX-XXXXX   RAID5D2
>>                                1:/dev/loop4 XXXXX-XXXXX   RAID5P
>>                  XXXXX-XXXXX   1:/dev/loop2 XXXXX-XXXXX   RAID5D1
>>                                1:/dev/loop3 XXXXX-XXXXX   RAID5D2
>>                                1:/dev/loop4 XXXXX-XXXXX   RAID5P
>> ------
>> Extent 0 and 1 are in different raid profile, it's only possible during convert, just used as an exmple
>> And Extent 1 are crossing 2 RAID5 stripes, so needs 2 logical range to show them all.
>
> This is "quite clear" from an human point of view. But is a nightmare for a script to parse.. And what is missing is
> something like "RAID5U" (U==unrelated) for element of the stripe but not of the file

Errr, right. It's not friendly for script at all.
But not that hard to fix.
------
EXT: FILE-OFFSET   LOGICAL RANGE DEVICE       DEVICE RANG   TYPE
0:   0-131071      XXXXX-XXXXX   1:/dev/loop0 XXXXX-XXXXX   RAID1
0:   0-131071      XXXXX-XXXXX   2:/dev/loop1 XXXXX-XXXXX   RAID1
1:   131072-196608 XXXXX-XXXXX   3:/dev/loop2 XXXXX-XXXXX   RAID5D1
1:   131072-196608 XXXXX-XXXXX   4:/dev/loop3 XXXXX-XXXXX   RAID5D2
1:   131072-196608 XXXXX-XXXXX   5:/dev/loop4 XXXXX-XXXXX   RAID5P
1:   131072-196608 XXXXX-XXXXX   3:/dev/loop2 XXXXX-XXXXX   RAID5D1
1:   131072-196608 XXXXX-XXXXX   4:/dev/loop3 XXXXX-XXXXX   RAID5D2
1:   131072-196608 XXXXX-XXXXX   5:/dev/loop4 XXXXX-XXXXX   RAID5P
------

Just pend all the extent number,file offset, logical range.
And for unrelated data stripe, add a "U" suffix would be good enough.

>
>
>>
>> Although it's quite hard to put the above output into 80 characters per line, it provides almost every info we need:
>> 1) File offset and its length
>> 2) Logical bytenr and its length
>> 3) Device bytenr and its length (since its length can differ from logical length)
>> 4) RAID type and its role.
>
> I am not against about your proposal; however I have to point out that the goal of these command was not to *traverse* the file, but only to found the physical location of a file offset. My use case was to simulate a corruption of a raid5 stripe elements: for me it was sufficient to know the page position.

For corruption case, the best practice would be extending 
btrfs-corrupt-block command.

And for your original proposal, to locate a page/sector containing the 
bytenr/offset, then the returned value should always be aligned to 
sectorsize. (And we need to state it clear in both man page and help string)

Unfortunately, that's not the case in current implementation.
(And don't forget future subpage sector size, so in that case, we need 
to check sectorsize first.)

For example, if user passes a unaligned logical, physical-find will 
return the device offset unaligned.

If only to locate the stripe/sector, at least returning a aligned number 
seems more reasonable.

IMHO if we only want a simple tool, then make it clear it's a just 
simple tool, and add limitation and explain to make it simple and won't 
accept any complext/unexpected input.

Or, make it handle unexpected and complex input well.


BTW, long time ago, btrfs-map-logical is under the same situation, just 
a simple tool do off-line logical->device offset mapping.
But it since it does provides offset/length pair options, it can cause 
wrong or uesless result for unaligned input.
And we spent some time to improve it.

So I hope we can avoid such problem which has already happened in 
map-logical.



>
> If you want these information to automate a test, I think that the range concept is more a problem than an help.
>
> I suggest to add a third command (btrfs insp ranges ?) which show what are you looking.
>
>>
>>
>>> $ dd 2>/dev/null if=/dev/loop0 skip=12582912 bs=1 count=5; echo
>>> adaaa
>>>
>>> ** Dup
>>>
>>> The command shows both the copies
>>>
>>> $ sudo mkfs.btrfs -f -d single -m single /dev/loop0
>>> $ sudo mount /dev/loop0 mnt/
>>> $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
>>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
>>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
>>> mnt/out.txt: 0
>>>         devid: 1 dev_name: /dev/loop0 offset: 71303168 type: DUP
>>>         devid: 1 dev_name: /dev/loop0 offset: 104857600 type: DUP
>>> $ dd 2>/dev/null if=/dev/loop0 skip=104857600 bs=1 count=5 ; echo
>>> adaaa
>>>
>>> ** Raid1
>>>
>>> The command shows both the copies
>>>
>>> $ sudo mkfs.btrfs -f -d raid1 -m raid1 /dev/loop0 /dev/loop1
>>> $ sudo mount /dev/loop0 mnt/
>>> $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
>>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0
>>>         devid: 2 dev_name: /dev/loop1 offset: 61865984 type: RAID1
>>>         devid: 1 dev_name: /dev/loop0 offset: 81788928 type: RAID1
>>> $ dd 2>/dev/null if=/dev/loop0 skip=81788928 bs=1 count=5; echo
>>> adaaa
>>>
>>> ** Raid10
>>>
>>> The command show both the copies; if you set an offset to the next disk-stripe, you can see the next pair of disk-stripe
>>>
>>> $ sudo mkfs.btrfs -f -d raid10 -m raid10 /dev/loop[0123]
>>> $ sudo mount /dev/loop0 mnt/
>>> $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
>>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0
>>>         devid: 4 dev_name: /dev/loop3 offset: 61931520 type: RAID10
>>>         devid: 3 dev_name: /dev/loop2 offset: 61931520 type: RAID10
>>> $ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5; echo
>>> adaaa
>>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt 65536
>>> mnt/out.txt: 65536
>>>         devid: 2 dev_name: /dev/loop1 offset: 61931520 type: RAID10
>>>         devid: 1 dev_name: /dev/loop0 offset: 81854464 type: RAID10
>>> $ dd 2>/dev/null if=/dev/loop0 skip=81854464 bs=1 count=5; echo
>>> bdbbb
>>>
>>> ** Raid5
>>>
>>> Depending by the offset, you can see which disk-stripe is used.
>>>
>>> $ sudo mkfs.btrfs -f -d raid5 -m raid5 /dev/loop[012]
>>> $ sudo mount /dev/loop0 mnt/
>>> $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
>>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
>>> mnt/out.txt: 0
>>>         devid: 2 dev_name: /dev/loop1 offset: 61931520 type: DATA
>>>         devid: 1 dev_name: /dev/loop0 offset: 81854464 type: OTHER
>>>         devid: 3 dev_name: /dev/loop2 offset: 61931520 type: PARITY
>>
>> Here DATA/OTHER is a little confusing.
>> For 4 disks raid5, will it be DATA/OTHER/OTHER and PARITY?
>>
>> What about RAID5D1 for the first data stripe and RAID5D2 for the second?
>
> And what about a data-stripe which is not related to the file which we are examining ?

Adding a "U" suffix if you like.
Or some other character like "*"?

BTW, if following this syntax, documentation is also important to 
explain such suffix.

>
>
>
>
>>
>> So for 4 disks raid5, it will be RAID5D1/D2/D3 and RAID5P (RAID5 PARITY)
>> And it's also confusing compared to RAID6.
>>
>>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt 65536mnt/out.txt: 65536
>>>         devid: 2 dev_name: /dev/loop1 offset: 61931520 type: OTHER
>>>         devid: 1 dev_name: /dev/loop0 offset: 81854464 type: DATA
>>>         devid: 3 dev_name: /dev/loop2 offset: 61931520 type: PARITY
>>> $ dd 2>/dev/null if=/dev/loop1 skip=61931520 bs=1 count=5; echo
>>> adaaa
>>> $ dd 2>/dev/null if=/dev/loop0 skip=81854464 bs=1 count=5; echo
>>> bdbbb
>>> $ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5 | xxd
>>> 00000000: 0300 0303 03                             .....
>>>
>>> The parity is computed as: parity=disk1^disk2. So "adaa" ^ "bdbb" == "\x03\x00\x03\x03
>>>
>>> ** Raid6
>>> $ sudo mkfs.btrfs -f -mraid6 -draid6 /dev/loop[0-4]^C
>>> $ sudo mount /dev/loop0 mnt/
>>> $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
>>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
>>> mnt/out.txt: 0
>>>         devid: 3 dev_name: /dev/loop2 offset: 61931520 type: DATA
>>>         devid: 2 dev_name: /dev/loop1 offset: 61931520 type: OTHER
>>>         devid: 1 dev_name: /dev/loop0 offset: 81854464 type: PARITY
>>>         devid: 4 dev_name: /dev/loop3 offset: 61931520 type: PARITY
>>
>> Same like RAID5.
>> IMHO RAID6D1/D2... and RAID6P RAID6Q seems better for me.
>>>
>>> $ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5 ; echo
>>> adaaa
>>>
>>>
>>> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
>>> ---
>>>  cmds-inspect.c | 587 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 587 insertions(+)
>>>
>>> diff --git a/cmds-inspect.c b/cmds-inspect.c
>>> index dd7b9dd..fc2e7c3 100644
>>> --- a/cmds-inspect.c
>>> +++ b/cmds-inspect.c
>>> @@ -22,6 +22,11 @@
>>>  #include <errno.h>
>>>  #include <getopt.h>
>>>  #include <limits.h>
>>> +#include <sys/types.h>
>>> +#include <sys/stat.h>
>>> +#include <fcntl.h>
>>> +#include <linux/fs.h>
>>> +#include <linux/fiemap.h>
>>>
>>>  #include "kerncompat.h"
>>>  #include "ioctl.h"
>>> @@ -623,6 +628,586 @@ out:
>>>      return !!ret;
>>>  }
>>>
>>> +
>>> +static const char * const cmd_inspect_physical_find_usage[] = {
>>> +    "btrfs inspect-internal physical-find <path> [<off>|-l <logical>]",
>>> +    "Show the physical placement of a file data.",
>>> +    "<path>     file to show",
>>
>> For resolving logical directly, not the file but any file/dir inside the fs though.
>
> Ok
>>
>>> +    "<off>      file offset to show; 0 if not specified",
>>> +    "<logical>  show info about a logical address instead of a file",
>>
>> Mentioned above, use -o|-s|-l options seems to be a better solution.
>
> ok
>>
>>> +    "This command requires root privileges",
>>> +    NULL
>>> +};
>>> +
>>> +#define STRIPE_INFO_LINEAR        1
>>> +#define STRIPE_INFO_DUP            2
>>> +#define STRIPE_INFO_RAID0        3
>>> +#define STRIPE_INFO_RAID1        4
>>> +#define STRIPE_INFO_RAID10        5
>>> +#define STRIPE_INFO_RAID56_DATA        6
>>> +#define STRIPE_INFO_RAID56_OTHER    7
>>> +#define STRIPE_INFO_RAID56_PARITY    8
>>
>> Mentioned before.
>> And since the STRIPE_INFO_* macro is only used in outputting the string, I prefer to do it in a helper function with if branches.
>
> ok
>
>>
>>> +
>>> +static const char * const stripe_info_descr[] = {
>>> +    [STRIPE_INFO_LINEAR] = "LINEAR",
>>> +    [STRIPE_INFO_DUP] = "DUP",
>>> +    [STRIPE_INFO_RAID0] = "RAID0",
>>> +    [STRIPE_INFO_RAID1] = "RAID1",
>>> +    [STRIPE_INFO_RAID10] = "RAID10",
>>> +    [STRIPE_INFO_RAID56_DATA] = "DATA",
>>> +    [STRIPE_INFO_RAID56_OTHER] = "OTHER",
>>> +    [STRIPE_INFO_RAID56_PARITY] = "PARITY",
>>> +};
>>> +
>>> +struct stripe_info {
>>> +    u64 devid;
>>> +    const char *dname;
>>> +    u64 phy_start;
>>> +    int type;
>>
>> IMHO "dname" contains all the neede info for the role of the stripe.
>> So "type" is useless here for me though.
>
> Sorry I can't understand you: dname is the device name; its role depends by
> several factors, so I add also the type field.

My fault, I just got confused and though dname is just the string output 
of type.

And I mean to replace "int type" with "char *type".

Since if following the new "RAID5D1" "RAID5D2U" syntax, type can't 
handle such output.

>
>>
>> And it's better to add a "u32 phy_length" to show how long the stripe is.
>>
>>> +};
>>> +
>>> +static void add_stripe_info(struct stripe_info **list, int *count,
>>> +    u64 devid, const char *dname, u64 phy_start, int type) {
>>> +
>>> +    if (*list == NULL)
>>> +        *count = 0;
>>> +
>>> +    ++*count;
>>> +    *list = realloc(*list, sizeof(struct stripe_info) * *count);
>>> +    /*
>>> +     * It is rude, but it should not happen for this kind of allocation...
>>> +     * ... and anyway when it happens, there are more severe problems
>>> +     * that this handling of "not enough memory"
>>> +     */
>>> +    if (*list == NULL) {
>>> +        error("Not nough memory: abort\n");
>>> +        exit(100);
>>
>> Same exit value problem here.
>
> ok
> [...]
>
>
>>> +
>>> +    } else if (chunk->type & BTRFS_BLOCK_GROUP_RAID1) {
>>> +        /*
>>> +         * RAID0: each chunk is composed by more disks;
>>> +         * each stripe_len bytes are in a different disk:
>>> +         *
>>> +         *  file: ABC...
>>> +         *
>>> +         *      disk1   disk2   disk3  ....
>>> +         *
>>> +         *        A       A
>>> +         *        B       B
>>> +         *        C       C
>>> +         *
>>> +         */
>>
>> Here btrfs raid1 is more flex than normal RAID1 implement.
>>
>> Better comment would be:
>>  Disk1   Disk2   Disk3
>>   A       A       B
>>   B       C       C
>
> ok
>
>>
>> And that's the real case for 3 disks RAID1 (for same disk size).
>
>
> [...]
>>> +static int cmd_inspect_physical_find(int argc, char **argv)
>>> +{
>>> +    int ret = 0;
> [...]
>>> +
>>> +    check_root_or_exit();
>>> +    check_btrfs_or_exit(fname);
>>
>> If we call get_fs_info(), is it really needed to check btrfs early?
>
> The two above are the mains reasons of failure of these command. So I
> preferred to add a clear check about which property we want.
> I think that is more clear a statemenmt like:
> 	"You need to be root to execute this command"
> instead of a generic EPERM: the user could think that it is
> sufficent to change the permission of the file
>

Makes sense.

This seems to be a more personal preference then.
Report possible error first or report error when it happens.

Although possible report error first means we need to keep the early 
check the same as the real function.

Any way, I'm OK if you want to keep it.

Thanks,
Qu
>
>>
>> Thanks,
>> Qu
>
> [...]
>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/5] New btrfs command: "btrfs inspect physical-find"
  2016-07-29  1:34       ` Qu Wenruo
@ 2016-07-29  5:08         ` Goffredo Baroncelli
  2016-07-29  6:44           ` Qu Wenruo
  0 siblings, 1 reply; 16+ messages in thread
From: Goffredo Baroncelli @ 2016-07-29  5:08 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: dsterba, Chris Mason

On 2016-07-29 03:34, Qu Wenruo wrote:
>> I am not against about your proposal; however I have to point out
>> that the goal of these command was not to *traverse* the file, but
>> only to found the physical location of a file offset. My use case
>> was to simulate a corruption of a raid5 stripe elements: for me it
>> was sufficient to know the page position.
> 
> For corruption case, the best practice would be extending
> btrfs-corrupt-block command.
> 
> And for your original proposal, to locate a page/sector containing
> the bytenr/offset, then the returned value should always be aligned
> to sectorsize. (And we need to state it clear in both man page and
> help string)
> 
> Unfortunately, that's not the case in current implementation. (And
> don't forget future subpage sector size, so in that case, we need to
> check sectorsize first.)
> 
> For example, if user passes a unaligned logical, physical-find will
> return the device offset unaligned.

For the other command (physical-dump), there is a check about the
alignment; the reason was to simplify the dump of the content.
However I don't understand to the reason to ask for the alignment
even in the -find tool: why the output have to be aligned ? Which is
the difference if I return the first byte address of the file than the
2nd or the 3rd (taking in account all the detail, which for raid5/6
is not very easy....)

> 
> If only to locate the stripe/sector, at least returning a aligned
> number seems more reasonable.
> 
> IMHO if we only want a simple tool, then make it clear it's a just
> simple tool, and add limitation and explain to make it simple and
> won't accept any complext/unexpected input.
> 
> Or, make it handle unexpected and complex input well.
> 
> 
> BTW, long time ago, btrfs-map-logical is under the same situation,
> just a simple tool do off-line logical->device offset mapping. But it
> since it does provides offset/length pair options, it can cause wrong
> or uesless result for unaligned input. And we spent some time to
> improve it.
> 
> So I hope we can avoid such problem which has already happened in
> map-logical.


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/5] New btrfs command: "btrfs inspect physical-find"
  2016-07-29  5:08         ` Goffredo Baroncelli
@ 2016-07-29  6:44           ` Qu Wenruo
  2016-07-29 17:14             ` Goffredo Baroncelli
  0 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2016-07-29  6:44 UTC (permalink / raw)
  To: kreijack, Qu Wenruo, linux-btrfs; +Cc: dsterba, Chris Mason



On 07/29/2016 01:08 PM, Goffredo Baroncelli wrote:
> On 2016-07-29 03:34, Qu Wenruo wrote:
>>> I am not against about your proposal; however I have to point out
>>> that the goal of these command was not to *traverse* the file, but
>>> only to found the physical location of a file offset. My use case
>>> was to simulate a corruption of a raid5 stripe elements: for me it
>>> was sufficient to know the page position.
>>
>> For corruption case, the best practice would be extending
>> btrfs-corrupt-block command.
>>
>> And for your original proposal, to locate a page/sector containing
>> the bytenr/offset, then the returned value should always be aligned
>> to sectorsize. (And we need to state it clear in both man page and
>> help string)
>>
>> Unfortunately, that's not the case in current implementation. (And
>> don't forget future subpage sector size, so in that case, we need to
>> check sectorsize first.)
>>
>> For example, if user passes a unaligned logical, physical-find will
>> return the device offset unaligned.
>
> For the other command (physical-dump), there is a check about the
> alignment; the reason was to simplify the dump of the content.
> However I don't understand to the reason to ask for the alignment
> even in the -find tool: why the output have to be aligned ? Which is
> the difference if I return the first byte address of the file than the
> 2nd or the 3rd (taking in account all the detail, which for raid5/6
> is not very easy....)

Since it's quite easy for user to assume such find tool will dump info 
for the range [offset, offset + 4K(or whatever)).
In that unaligned case, user could get confused about if the tool will 
dump the 4K range, including the next stripe.

Just like map-logical.

So, if you only mean to dump the stripe info which contains the bytenr, 
then makes the doc more clear about the behavior.

Thanks,
Qu


>
>>
>> If only to locate the stripe/sector, at least returning a aligned
>> number seems more reasonable.
>>
>> IMHO if we only want a simple tool, then make it clear it's a just
>> simple tool, and add limitation and explain to make it simple and
>> won't accept any complext/unexpected input.
>>
>> Or, make it handle unexpected and complex input well.
>>
>>
>> BTW, long time ago, btrfs-map-logical is under the same situation,
>> just a simple tool do off-line logical->device offset mapping. But it
>> since it does provides offset/length pair options, it can cause wrong
>> or uesless result for unaligned input. And we spent some time to
>> improve it.
>>
>> So I hope we can avoid such problem which has already happened in
>> map-logical.
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/5] New btrfs command: "btrfs inspect physical-find"
  2016-07-29  6:44           ` Qu Wenruo
@ 2016-07-29 17:14             ` Goffredo Baroncelli
  2016-07-30  1:04               ` Qu Wenruo
  0 siblings, 1 reply; 16+ messages in thread
From: Goffredo Baroncelli @ 2016-07-29 17:14 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs; +Cc: dsterba, Chris Mason

On 2016-07-29 08:44, Qu Wenruo wrote:
> 
> 
> On 07/29/2016 01:08 PM, Goffredo Baroncelli wrote:
>> On 2016-07-29 03:34, Qu Wenruo wrote:
>>>> I am not against about your proposal; however I have to point
>>>> out that the goal of these command was not to *traverse* the
>>>> file, but only to found the physical location of a file offset.
>>>> My use case was to simulate a corruption of a raid5 stripe
>>>> elements: for me it was sufficient to know the page position.
>>> 
>>> For corruption case, the best practice would be extending 
>>> btrfs-corrupt-block command.
>>> 
>>> And for your original proposal, to locate a page/sector
>>> containing the bytenr/offset, then the returned value should
>>> always be aligned to sectorsize. (And we need to state it clear
>>> in both man page and help string)
>>> 
>>> Unfortunately, that's not the case in current implementation.
>>> (And don't forget future subpage sector size, so in that case, we
>>> need to check sectorsize first.)
>>> 
>>> For example, if user passes a unaligned logical, physical-find
>>> will return the device offset unaligned.
>> 
>> For the other command (physical-dump), there is a check about the 
>> alignment; the reason was to simplify the dump of the content. 
>> However I don't understand to the reason to ask for the alignment 
>> even in the -find tool: why the output have to be aligned ? Which
>> is the difference if I return the first byte address of the file
>> than the 2nd or the 3rd (taking in account all the detail, which
>> for raid5/6 is not very easy....)
> 
> Since it's quite easy for user to assume such find tool will dump
> info for the range [offset, offset + 4K(or whatever)). In that
> unaligned case, user could get confused about if the tool will dump
> the 4K range, including the next stripe.


I am still confused: we are talking about three tools:

1) btrfs insp physical-find
it is definitely not page boundary dependent

2) btrfs insp physical-dump
this implementation is page boundary dependent; and its man-page clear
reported this limit; this constraint might be removed with
further development.

3) a new tool which dumps the physical location of the file contents. It may be 
an extension of 1) or a new development, but at this stage it is too early to talk
about this limit.

am I missing something ?

> 
> Just like map-logical.
> 
> So, if you only mean to dump the stripe info which contains the
> bytenr, then makes the doc more clear about the behavior.
> 
> Thanks, Qu
> 
> 
>> 
>>> 
>>> If only to locate the stripe/sector, at least returning a
>>> aligned number seems more reasonable.
>>> 
>>> IMHO if we only want a simple tool, then make it clear it's a
>>> just simple tool, and add limitation and explain to make it
>>> simple and won't accept any complext/unexpected input.
>>> 
>>> Or, make it handle unexpected and complex input well.
>>> 
>>> 
>>> BTW, long time ago, btrfs-map-logical is under the same
>>> situation, just a simple tool do off-line logical->device offset
>>> mapping. But it since it does provides offset/length pair
>>> options, it can cause wrong or uesless result for unaligned
>>> input. And we spent some time to improve it.
>>> 
>>> So I hope we can avoid such problem which has already happened
>>> in map-logical.
>> 
>> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/5] New btrfs command: "btrfs inspect physical-find"
  2016-07-29 17:14             ` Goffredo Baroncelli
@ 2016-07-30  1:04               ` Qu Wenruo
  0 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2016-07-30  1:04 UTC (permalink / raw)
  To: kreijack, Qu Wenruo, linux-btrfs; +Cc: dsterba, Chris Mason



On 07/30/2016 01:14 AM, Goffredo Baroncelli wrote:
> On 2016-07-29 08:44, Qu Wenruo wrote:
>>
>>
>> On 07/29/2016 01:08 PM, Goffredo Baroncelli wrote:
>>> On 2016-07-29 03:34, Qu Wenruo wrote:
>>>>> I am not against about your proposal; however I have to point
>>>>> out that the goal of these command was not to *traverse* the
>>>>> file, but only to found the physical location of a file offset.
>>>>> My use case was to simulate a corruption of a raid5 stripe
>>>>> elements: for me it was sufficient to know the page position.
>>>>
>>>> For corruption case, the best practice would be extending
>>>> btrfs-corrupt-block command.
>>>>
>>>> And for your original proposal, to locate a page/sector
>>>> containing the bytenr/offset, then the returned value should
>>>> always be aligned to sectorsize. (And we need to state it clear
>>>> in both man page and help string)
>>>>
>>>> Unfortunately, that's not the case in current implementation.
>>>> (And don't forget future subpage sector size, so in that case, we
>>>> need to check sectorsize first.)
>>>>
>>>> For example, if user passes a unaligned logical, physical-find
>>>> will return the device offset unaligned.
>>>
>>> For the other command (physical-dump), there is a check about the
>>> alignment; the reason was to simplify the dump of the content.
>>> However I don't understand to the reason to ask for the alignment
>>> even in the -find tool: why the output have to be aligned ? Which
>>> is the difference if I return the first byte address of the file
>>> than the 2nd or the 3rd (taking in account all the detail, which
>>> for raid5/6 is not very easy....)
>>
>> Since it's quite easy for user to assume such find tool will dump
>> info for the range [offset, offset + 4K(or whatever)). In that
>> unaligned case, user could get confused about if the tool will dump
>> the 4K range, including the next stripe.
>
>
> I am still confused: we are talking about three tools:
>
> 1) btrfs insp physical-find
> it is definitely not page boundary dependent

Yes, that's what we are talking about.
But see my later comment.
>
> 2) btrfs insp physical-dump
> this implementation is page boundary dependent; and its man-page clear
> reported this limit; this constraint might be removed with
> further development.

Nothing related to physical-dump, I didn't mention that.

What I am talking about is, "physical-find" without length support seems 
useless.
It only ensures the byte on device is mapped for the offset user specified.

IMHO, since fs is doing its work in sector size, then we should return 
result also in sector size.
Just like what physical-dump is doing.

Thanks,
Qu
>
> 3) a new tool which dumps the physical location of the file contents. It may be
> an extension of 1) or a new development, but at this stage it is too early to talk
> about this limit.
>
> am I missing something ?
>
>>
>> Just like map-logical.
>>
>> So, if you only mean to dump the stripe info which contains the
>> bytenr, then makes the doc more clear about the behavior.
>>
>> Thanks, Qu
>>
>>
>>>
>>>>
>>>> If only to locate the stripe/sector, at least returning a
>>>> aligned number seems more reasonable.
>>>>
>>>> IMHO if we only want a simple tool, then make it clear it's a
>>>> just simple tool, and add limitation and explain to make it
>>>> simple and won't accept any complext/unexpected input.
>>>>
>>>> Or, make it handle unexpected and complex input well.
>>>>
>>>>
>>>> BTW, long time ago, btrfs-map-logical is under the same
>>>> situation, just a simple tool do off-line logical->device offset
>>>> mapping. But it since it does provides offset/length pair
>>>> options, it can cause wrong or uesless result for unaligned
>>>> input. And we spent some time to improve it.
>>>>
>>>> So I hope we can avoid such problem which has already happened
>>>> in map-logical.
>>>
>>>
>>
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 2/5] New btrfs command: "btrfs inspect physical-find"
  2016-07-24 11:03 [BTRFS-PROGS][PATCH] " Goffredo Baroncelli
@ 2016-07-24 11:03 ` Goffredo Baroncelli
  0 siblings, 0 replies; 16+ messages in thread
From: Goffredo Baroncelli @ 2016-07-24 11:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, Chris Mason, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

The aim of this new command is to show the physical placement on the disk
of a file.
Currently it handles all the profiles (single, dup, raid1/10/5/6).

The syntax is simple:

where:
  <filename> is the file to inspect
  <offset> is the offset of the file to inspect (default 0)

Below some examples:

** Single

$ sudo mkfs.btrfs -f -d single -m single /dev/loop0
$ sudo mount /dev/loop0 mnt/
$ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
mnt/out.txt: 0
        devid: 1 dev_name: /dev/loop0 offset: 12582912 type: LINEAR
$ dd 2>/dev/null if=/dev/loop0 skip=12582912 bs=1 count=5; echo
adaaa

** Dup

The command shows both the copies

$ sudo mkfs.btrfs -f -d single -m single /dev/loop0
$ sudo mount /dev/loop0 mnt/
$ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
mnt/out.txt: 0
        devid: 1 dev_name: /dev/loop0 offset: 71303168 type: DUP
        devid: 1 dev_name: /dev/loop0 offset: 104857600 type: DUP
$ dd 2>/dev/null if=/dev/loop0 skip=104857600 bs=1 count=5 ; echo
adaaa

** Raid1

The command shows both the copies

$ sudo mkfs.btrfs -f -d raid1 -m raid1 /dev/loop0 /dev/loop1
$ sudo mount /dev/loop0 mnt/
$ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0
        devid: 2 dev_name: /dev/loop1 offset: 61865984 type: RAID1
        devid: 1 dev_name: /dev/loop0 offset: 81788928 type: RAID1
$ dd 2>/dev/null if=/dev/loop0 skip=81788928 bs=1 count=5; echo
adaaa

** Raid10

The command show both the copies; if you set an offset to the next disk-stripe, you can see the next pair of disk-stripe

$ sudo mkfs.btrfs -f -d raid10 -m raid10 /dev/loop[0123]
$ sudo mount /dev/loop0 mnt/
$ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0
        devid: 4 dev_name: /dev/loop3 offset: 61931520 type: RAID10
        devid: 3 dev_name: /dev/loop2 offset: 61931520 type: RAID10
$ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5; echo
adaaa
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt 65536
mnt/out.txt: 65536
        devid: 2 dev_name: /dev/loop1 offset: 61931520 type: RAID10
        devid: 1 dev_name: /dev/loop0 offset: 81854464 type: RAID10
$ dd 2>/dev/null if=/dev/loop0 skip=81854464 bs=1 count=5; echo
bdbbb

** Raid5

Depending by the offset, you can see which disk-stripe is used.

$ sudo mkfs.btrfs -f -d raid5 -m raid5 /dev/loop[012]
$ sudo mount /dev/loop0 mnt/
$ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
mnt/out.txt: 0
        devid: 2 dev_name: /dev/loop1 offset: 61931520 type: DATA
        devid: 1 dev_name: /dev/loop0 offset: 81854464 type: OTHER
        devid: 3 dev_name: /dev/loop2 offset: 61931520 type: PARITY
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt 65536mnt/out.txt: 65536
        devid: 2 dev_name: /dev/loop1 offset: 61931520 type: OTHER
        devid: 1 dev_name: /dev/loop0 offset: 81854464 type: DATA
        devid: 3 dev_name: /dev/loop2 offset: 61931520 type: PARITY
$ dd 2>/dev/null if=/dev/loop1 skip=61931520 bs=1 count=5; echo
adaaa
$ dd 2>/dev/null if=/dev/loop0 skip=81854464 bs=1 count=5; echo
bdbbb
$ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5 | xxd
00000000: 0300 0303 03                             .....

The parity is computed as: parity=disk1^disk2. So "adaa" ^ "bdbb" == "\x03\x00\x03\x03

** Raid6
$ sudo mkfs.btrfs -f -mraid6 -draid6 /dev/loop[0-4]^C
$ sudo mount /dev/loop0 mnt/
$ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
mnt/out.txt: 0
        devid: 3 dev_name: /dev/loop2 offset: 61931520 type: DATA
        devid: 2 dev_name: /dev/loop1 offset: 61931520 type: OTHER
        devid: 1 dev_name: /dev/loop0 offset: 81854464 type: PARITY
        devid: 4 dev_name: /dev/loop3 offset: 61931520 type: PARITY

$ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5 ; echo
adaaa


Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 cmds-inspect.c | 550 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 550 insertions(+)

diff --git a/cmds-inspect.c b/cmds-inspect.c
index dd7b9dd..dd0570b 100644
--- a/cmds-inspect.c
+++ b/cmds-inspect.c
@@ -22,6 +22,11 @@
 #include <errno.h>
 #include <getopt.h>
 #include <limits.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <linux/fs.h>
+#include <linux/fiemap.h>
 
 #include "kerncompat.h"
 #include "ioctl.h"
@@ -623,6 +628,549 @@ out:
 	return !!ret;
 }
 
+
+static const char * const cmd_inspect_physical_find_usage[] = {
+	"btrfs inspect-internal physical-find <path> [<off>]",
+	"Show the physical placement of a file data.",
+	"<path>   file to show",
+	"<off>    file offset to show; 0 if not specified",
+	"This command requires root privileges",
+	NULL
+};
+
+#define STRIPE_INFO_LINEAR		1
+#define STRIPE_INFO_DUP			2
+#define STRIPE_INFO_RAID0		3
+#define STRIPE_INFO_RAID1		4
+#define STRIPE_INFO_RAID10		5
+#define STRIPE_INFO_RAID56_DATA		6
+#define STRIPE_INFO_RAID56_OTHER	7
+#define STRIPE_INFO_RAID56_PARITY	8
+
+static const char * const stripe_info_descr[] = {
+	[STRIPE_INFO_LINEAR] = "LINEAR",
+	[STRIPE_INFO_DUP] = "DUP",
+	[STRIPE_INFO_RAID0] = "RAID0",
+	[STRIPE_INFO_RAID1] = "RAID1",
+	[STRIPE_INFO_RAID10] = "RAID10",
+	[STRIPE_INFO_RAID56_DATA] = "DATA",
+	[STRIPE_INFO_RAID56_OTHER] = "OTHER",
+	[STRIPE_INFO_RAID56_PARITY] = "PARITY",
+};
+
+struct stripe_info {
+	u64 devid;
+	const char *dname;
+	u64 phy_start;
+	int type;
+};
+
+static void add_stripe_info(struct stripe_info **list, int *count,
+	u64 devid, const char *dname, u64 phy_start, int type) {
+
+	if (*list == NULL)
+		*count = 0;
+
+	++*count;
+	*list = realloc(*list, sizeof(struct stripe_info) * *count);
+	/*
+	 * It is rude, but it should not happen for this kind of allocation...
+	 * ... and anyway when it happens, there are more severe problems
+	 * that this handling of "not enough memory"
+	 */
+	if (*list == NULL) {
+		error("Not nough memory: abort\n");
+		exit(100);
+	}
+
+	(*list)[*count-1].devid = devid;
+	(*list)[*count-1].dname = dname;
+	(*list)[*count-1].phy_start = phy_start;
+	(*list)[*count-1].type = type;
+}
+
+static void dump_stripes(int ndisks, struct btrfs_ioctl_dev_info_args *disks,
+			 struct btrfs_chunk *chunk, u64 logical_start,
+			 struct stripe_info **stripes_ret, int *stripes_count) {
+	struct btrfs_stripe *stripes;
+
+	stripes = &chunk->stripe;
+
+	if ((chunk->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0) {
+		/* LINEAR: each chunk has (should have) only one disk */
+		int j;
+		char *dname = "<NOT FOUND>";
+
+		assert(chunk->num_stripes == 1);
+
+		u64 phy_start = stripes[0].offset +
+			+logical_start;
+		for (j = 0 ; j < ndisks ; j++) {
+			if (stripes[0].devid == disks[j].devid) {
+				dname = (char *)disks[j].path;
+				break;
+			}
+		}
+
+		add_stripe_info(stripes_ret, stripes_count,
+			stripes[0].devid, dname, phy_start,
+			STRIPE_INFO_LINEAR);
+	} else if (chunk->type & BTRFS_BLOCK_GROUP_RAID0) {
+		/*
+		 * RAID0: each chunk is composed by more disks;
+		 * each stripe_len bytes are in a different disk:
+		 *
+		 *  file: ABC...NMOP....
+		 *
+		 *      disk1   disk2    disk3  .... disksN
+		 *
+		 *        A      B         C    ....    N
+		 *        M      O         P    ....
+		 *
+		 */
+		u64 disks_number = chunk->num_stripes;
+		u64 disk_stripe_size = chunk->stripe_len;
+		u64 stripe_capacity;
+		u64 stripe_nr;
+		u64 disk_stripe_start;
+		int sidx;
+		int j;
+		char *dname = "<NOT FOUND>";
+
+		stripe_capacity = disks_number * disk_stripe_size;
+		stripe_nr = logical_start / stripe_capacity;
+		disk_stripe_start = logical_start % disk_stripe_size;
+
+		sidx = (logical_start / disk_stripe_size) % disks_number;
+
+		u64 phy_start = stripes[sidx].offset +
+			stripe_nr * disk_stripe_size +
+			disk_stripe_start;
+
+		for (j = 0 ; j < ndisks ; j++) {
+			if (stripes[sidx].devid == disks[j].devid) {
+				dname = (char *)disks[j].path;
+				break;
+			}
+		}
+
+		add_stripe_info(stripes_ret, stripes_count,
+			stripes[sidx].devid, dname, phy_start,
+			STRIPE_INFO_RAID0);
+
+	} else if (chunk->type & BTRFS_BLOCK_GROUP_RAID1) {
+		/*
+		 * RAID0: each chunk is composed by more disks;
+		 * each stripe_len bytes are in a different disk:
+		 *
+		 *  file: ABC...
+		 *
+		 *      disk1   disk2   disk3  ....
+		 *
+		 *        A       A
+		 *        B       B
+		 *        C       C
+		 *
+		 */
+		int sidx;
+
+		for (sidx = 0; sidx < chunk->num_stripes; sidx++) {
+			int j;
+			char *dname = "<NOT FOUND>";
+			u64 phy_start = stripes[sidx].offset +
+				+logical_start;
+
+			for (j = 0 ; j < ndisks ; j++) {
+				if (stripes[sidx].devid == disks[j].devid) {
+					dname = (char *)disks[j].path;
+					break;
+				}
+			}
+			add_stripe_info(stripes_ret, stripes_count,
+				stripes[sidx].devid, dname, phy_start,
+				STRIPE_INFO_RAID1);
+		}
+
+	} else if (chunk->type & BTRFS_BLOCK_GROUP_DUP) {
+		/*
+		 * DUP: each chunk has 'num_stripes' disk_stripe. Heach
+		 * disk_stripe has its own copy of data
+		 *
+		 *  file: ABCD....
+		 *
+		 *      disk1   disk2   disk3
+		 *
+		 *        A
+		 *        B
+		 *        C
+		 *      [...]
+		 *        A
+		 *        B
+		 *        C
+		 *
+		 *
+		 * NOTE: the difference between DUP and RAID1 is that
+		 * in RAID1 each disk_stripe is in a different disk, in DUP
+		 * each disk chunk is in the same disk
+		 */
+		int sidx;
+
+		for (sidx = 0; sidx < chunk->num_stripes; sidx++) {
+			int j;
+			char *dname = "<NOT FOUND>";
+			u64 phy_start = stripes[sidx].offset +
+				+logical_start;
+
+			for (j = 0 ; j < ndisks ; j++) {
+				if (stripes[sidx].devid == disks[j].devid) {
+					dname = (char *)disks[j].path;
+					break;
+				}
+			}
+
+			add_stripe_info(stripes_ret, stripes_count,
+				stripes[sidx].devid, dname, phy_start,
+				STRIPE_INFO_DUP);
+		}
+	} else if (chunk->type & BTRFS_BLOCK_GROUP_RAID10) {
+		/*
+		 * RAID10: each chunk is composed by more disks;
+		 * each stripe_len bytes are in a different disk:
+		 *
+		 *  file: ABCD....
+		 *
+		 *      disk1   disk2   disk3   disk4
+		 *
+		 *        A      A         B      B
+		 *        C      C         D      D
+		 *
+		 *
+		 */
+		int i;
+		u64 disks_number = chunk->num_stripes;
+		u64 disk_stripe_size = chunk->stripe_len;
+		u64 stripe_capacity;
+		u64 stripe_nr;
+		u64 stripe_start;
+		u64 disk_stripe_start;
+
+		stripe_capacity = disks_number * disk_stripe_size / chunk->sub_stripes;
+		stripe_nr = logical_start / stripe_capacity;
+		stripe_start = logical_start % stripe_capacity;
+		disk_stripe_start = logical_start % disk_stripe_size;
+
+		for (i = 0; i < chunk->sub_stripes; i++) {
+			int j;
+			char *dname = "<NOT FOUND>";
+			int sidx = (i +
+				stripe_start/disk_stripe_size*chunk->sub_stripes) %
+				disks_number;
+
+			u64 phy_start = stripes[sidx].offset +
+				+stripe_nr*disk_stripe_size + disk_stripe_start;
+
+			for (j = 0 ; j < ndisks ; j++) {
+				if (stripes[sidx].devid == disks[j].devid) {
+					dname = (char *)disks[j].path;
+					break;
+				}
+			}
+
+			add_stripe_info(stripes_ret, stripes_count,
+				stripes[sidx].devid, dname, phy_start,
+				STRIPE_INFO_RAID10);
+
+		}
+	} else if (chunk->type & BTRFS_BLOCK_GROUP_RAID5 ||
+			chunk->type & BTRFS_BLOCK_GROUP_RAID6) {
+		/*
+		 * RAID5: each chunk is spread on a different disk; however one
+		 * disk is used for parity
+		 *
+		 *  file: ABCDEFGHIJK....
+		 *
+		 *      disk1  disk2  disk3  disk4  disk5
+		 *
+		 *        A      B      C      D      P
+		 *        P      D      E      F      G
+		 *        H      P      I      J      K
+		 *
+		 *   Note: P == parity
+		 *
+		 * RAID6: each chunk is spread on a different disk; however two
+		 * disks are used for parity
+		 *
+		 *  file: ABCDEFGHI...
+		 *
+		 *      disk1  disk2  disk3  disk4  disk5
+		 *
+		 *        A      B      C      P      Q
+		 *        Q      D      E      F      P
+		 *        P      Q      G      H      I
+		 *
+		 *   Note: P,Q == parity
+		 *
+		 */
+		int parities_nr = 1;
+		u64 disks_number = chunk->num_stripes;
+		u64 disk_stripe_size = chunk->stripe_len;
+		u64 stripe_capacity;
+		u64 stripe_nr;
+		u64 stripe_start;
+		u64 pos = 0;
+		u64 disk_stripe_start;
+		int sidx;
+
+		if (chunk->type & BTRFS_BLOCK_GROUP_RAID6)
+			parities_nr = 2;
+
+		stripe_capacity = (disks_number - parities_nr) *
+						disk_stripe_size;
+		stripe_nr = logical_start / stripe_capacity;
+		stripe_start = logical_start % stripe_capacity;
+		disk_stripe_start = logical_start % disk_stripe_size;
+
+		for (sidx = 0; sidx < disks_number ; sidx++) {
+			int j;
+			char *dname = "<NOT FOUND>";
+			u64 stripe_index = (sidx + stripe_nr) % disks_number;
+			u64 phy_start = stripes[stripe_index].offset + /* chunk start */
+				+ stripe_nr*disk_stripe_size +  /* stripe start */
+				+ disk_stripe_start;
+
+			for (j = 0 ; j < ndisks ; j++)
+				if (stripes[stripe_index].devid == disks[j].devid) {
+				dname = (char *)disks[j].path;
+				break;
+				}
+
+			if (sidx >= (disks_number - parities_nr)) {
+				add_stripe_info(stripes_ret, stripes_count,
+					stripes[stripe_index].devid, dname, phy_start,
+					STRIPE_INFO_RAID56_PARITY);
+				continue;
+			}
+
+			if (stripe_start >= pos && stripe_start < (pos+disk_stripe_size)) {
+				add_stripe_info(stripes_ret, stripes_count,
+					stripes[stripe_index].devid, dname, phy_start,
+					STRIPE_INFO_RAID56_DATA);
+			} else {
+				add_stripe_info(stripes_ret, stripes_count,
+					stripes[stripe_index].devid, dname, phy_start,
+					STRIPE_INFO_RAID56_OTHER);
+			}
+
+			pos += disk_stripe_size;
+		}
+		assert(pos == stripe_capacity);
+	} else {
+		error("Unknown chunk type = 0x%016llx\n", chunk->type);
+		return;
+	}
+
+}
+
+static int get_chunk_offset(int fd, u64 logical_start,
+	struct btrfs_chunk *chunk_ret, u64 *off_ret) {
+
+	struct btrfs_ioctl_search_args args;
+	struct btrfs_ioctl_search_key *sk = &args.key;
+	struct btrfs_ioctl_search_header sh;
+	unsigned long off = 0;
+	int i;
+
+	memset(&args, 0, sizeof(args));
+	sk->tree_id = BTRFS_CHUNK_TREE_OBJECTID;
+	sk->min_objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
+	sk->max_objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
+	sk->min_type = BTRFS_CHUNK_ITEM_KEY;
+	sk->max_type = BTRFS_CHUNK_ITEM_KEY;
+	sk->max_offset = (u64)-1;
+	sk->min_offset = 0;
+	sk->max_transid = (u64)-1;
+
+	while (1) {
+		int ret;
+
+		sk->nr_items = 1;
+		ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, &args);
+		if (ret < 0)
+			return -errno;
+
+		if (sk->nr_items == 0)
+			break;
+
+		off = 0;
+		for (i = 0; i < sk->nr_items; i++) {
+			struct btrfs_chunk *item;
+
+			memcpy(&sh, args.buf + off, sizeof(sh));
+			off += sizeof(sh);
+			item = (struct btrfs_chunk *)(args.buf + off);
+			off += sh.len;
+
+			if (logical_start >= sh.offset &&
+			    logical_start < sh.offset+item->length) {
+				memcpy(chunk_ret, item, sh.len);
+				*off_ret = logical_start-sh.offset;
+				return 0;
+			}
+
+			sk->min_objectid = sh.objectid;
+			sk->min_type = sh.type;
+			sk->min_offset = sh.offset;
+		}
+
+		if (sk->min_offset < (u64)-1)
+			sk->min_offset++;
+		else
+			break;
+	}
+
+	return 1; /* not found */
+}
+
+/*
+ * Inline extents are skipped because they do not take data space,
+ * delalloc and unknown are skipped because we do not know how much
+ * space they will use yet.
+ */
+#define SKIP_FLAGS	(FIEMAP_EXTENT_UNKNOWN|FIEMAP_EXTENT_DELALLOC| \
+			 FIEMAP_EXTENT_DATA_INLINE)
+static int cmd_inspect_physical_find(int argc, char **argv)
+{
+	int ret = 0;
+	u64 logical = 0ull;
+	int fd = -1;
+	int last = 0;
+	char buf[16384];
+	char *fname;
+	int found = 0;
+	struct fiemap *fiemap = (struct fiemap *)buf;
+	struct fiemap_extent *fm_ext;
+	const int count = (sizeof(buf) - sizeof(*fiemap)) /
+					sizeof(struct fiemap_extent);
+	struct btrfs_ioctl_dev_info_args *disks = NULL;
+	struct btrfs_ioctl_fs_info_args fi_args = {0};
+	char btrfs_chunk_data[4096];
+	struct btrfs_chunk *chunk_item = (struct btrfs_chunk *)&btrfs_chunk_data;
+	u64 chunk_offset = 0;
+	int minargc = 1;
+	struct stripe_info *stripes = NULL;
+	int stripes_count = 0;
+	int i;
+	int rc;
+
+	memset(fiemap, 0, sizeof(struct fiemap));
+
+	if (check_argc_min(argc - minargc, 1) || check_argc_max(argc - minargc, 2))
+		usage(cmd_inspect_physical_find_usage);
+
+	if (argc - minargc == 2)
+		logical = strtoull(argv[minargc+1], NULL, 0);
+	fname = argv[minargc];
+
+	check_root_or_exit();
+	check_btrfs_or_exit(fname);
+
+	printf("%s: %llu\n", fname, logical);
+
+	fd = open(fname, O_RDONLY);
+	if (fd < 0) {
+		error("Can't open '%s' for reading\n", fname);
+		ret = -errno;
+		goto out;
+	}
+
+	do {
+
+		int rc;
+		int j;
+
+		fiemap->fm_length = ~0ULL;
+		fiemap->fm_extent_count = count;
+		fiemap->fm_flags = FIEMAP_FLAG_SYNC;
+		rc = ioctl(fd, FS_IOC_FIEMAP, (unsigned long) fiemap);
+		if (rc < 0) {
+			error("Can't do ioctl()\n");
+			ret = -errno;
+			goto out;
+		}
+
+		for (j = 0; j < fiemap->fm_mapped_extents; j++) {
+			u32 flags;
+
+			fm_ext = &fiemap->fm_extents[j];
+			flags = fm_ext->fe_flags;
+
+			fiemap->fm_start = (fm_ext->fe_logical +
+					fm_ext->fe_length);
+
+			if (flags & FIEMAP_EXTENT_LAST)
+				last = 1;
+
+			if (flags & SKIP_FLAGS)
+				continue;
+
+			if (logical > fm_ext->fe_logical +
+			    fm_ext->fe_length)
+				continue;
+
+			found = 1;
+			break;
+		}
+	} while (last == 0 || found == 0);
+
+
+	if (!found) {
+		error("Can't find the extent: the file is too short, or the file is stored in a leaf.\n");
+		ret = 10;
+		goto out;
+	}
+
+	rc = get_fs_info(fname, &fi_args, &disks);
+	if (rc < 0) {
+		error("Cannot get info for the filesystem: may be it is not a btrfs filesystem ?\n");
+		ret = 12;
+		goto out;
+	}
+
+	rc = get_chunk_offset(fd,
+		fm_ext->fe_physical + logical - fm_ext->fe_logical,
+		chunk_item, &chunk_offset);
+	if (rc < 0) {
+		error("cannot perform the search: %s", strerror(rc));
+		ret = 13;
+		goto out;
+	}
+	if (rc != 0) {
+		error("cannot find chunk\n");
+		ret = 14;
+		goto out;
+	}
+
+	dump_stripes(fi_args.num_devices, disks,
+		     chunk_item, chunk_offset,
+		     &stripes, &stripes_count);
+
+	for (i = 0 ; i < stripes_count ; i++) {
+		printf("devid: %llu dev_name: %s offset: %llu type: %s\n",
+			stripes[i].devid, stripes[i].dname,
+			stripes[i].phy_start,
+			stripe_info_descr[stripes[i].type]);
+	}
+
+out:
+	if (fd != -1)
+		close(fd);
+	if (disks != NULL)
+		free(disks);
+	if (stripes != NULL)
+		free(stripes);
+	return ret;
+}
+
 static const char inspect_cmd_group_info[] =
 "query various internal information";
 
@@ -644,6 +1192,8 @@ const struct cmd_group inspect_cmd_group = {
 				cmd_inspect_dump_super_usage, NULL, 0 },
 		{ "tree-stats", cmd_inspect_tree_stats,
 				cmd_inspect_tree_stats_usage, NULL, 0 },
+		{ "physical-find", cmd_inspect_physical_find,
+				cmd_inspect_physical_find_usage, NULL, 0 },
 		NULL_CMD_STRUCT
 	}
 };
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-07-30  1:04 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-27 17:43 [BTRFS-PROGS][PATCH][V2] Add two new commands: 'btrfs insp physical-find' and 'btrfs insp physical-dump' Goffredo Baroncelli
2016-07-27 17:43 ` [PATCH 1/5] Add some helper functions Goffredo Baroncelli
2016-07-28  1:03   ` Qu Wenruo
2016-07-27 17:43 ` [PATCH 2/5] New btrfs command: "btrfs inspect physical-find" Goffredo Baroncelli
2016-07-28  1:47   ` Qu Wenruo
2016-07-28 20:25     ` Goffredo Baroncelli
2016-07-29  1:34       ` Qu Wenruo
2016-07-29  5:08         ` Goffredo Baroncelli
2016-07-29  6:44           ` Qu Wenruo
2016-07-29 17:14             ` Goffredo Baroncelli
2016-07-30  1:04               ` Qu Wenruo
2016-07-27 17:43 ` [PATCH 3/5] new command btrfs inspect physical-dump Goffredo Baroncelli
2016-07-27 17:43 ` [PATCH 4/5] Add man page for command btrfs insp physical-find Goffredo Baroncelli
2016-07-27 17:43 ` [PATCH 5/5] Add new command to man pages: btrfs insp physical-dump Goffredo Baroncelli
2016-07-28 12:03 ` [BTRFS-PROGS][PATCH][V2] Add two new commands: 'btrfs insp physical-find' and 'btrfs insp physical-dump' David Sterba
  -- strict thread matches above, loose matches on Subject: below --
2016-07-24 11:03 [BTRFS-PROGS][PATCH] " Goffredo Baroncelli
2016-07-24 11:03 ` [PATCH 2/5] New btrfs command: "btrfs inspect physical-find" Goffredo Baroncelli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.