From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cn.fujitsu.com ([59.151.112.132]:52975 "EHLO heian.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1750768AbcG2Bec (ORCPT ); Thu, 28 Jul 2016 21:34:32 -0400 Subject: Re: [PATCH 2/5] New btrfs command: "btrfs inspect physical-find" To: , References: <1469641398-3879-1-git-send-email-kreijack@libero.it> <1469641398-3879-3-git-send-email-kreijack@libero.it> <8b013280-7128-d6e4-2f62-b43851d6f6cd@cn.fujitsu.com> <2d33450d-3805-da54-a5e8-2fefde075957@libero.it> CC: , Chris Mason From: Qu Wenruo Message-ID: Date: Fri, 29 Jul 2016 09:34:12 +0800 MIME-Version: 1.0 In-Reply-To: <2d33450d-3805-da54-a5e8-2fefde075957@libero.it> Content-Type: text/plain; charset="utf-8"; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: Hi, Goffredo, Sorry I forgot to mention that, even btrfs-map-logcal is an offline tool, it can still handle mount fs too. Although it's also true that it still lacks the needed RAID flags and stripe info. At 07/29/2016 04:25 AM, Goffredo Baroncelli wrote: > Hi Qu, > > On 2016-07-28 03:47, Qu Wenruo wrote: >> At 07/28/2016 01:43 AM, Goffredo Baroncelli wrote: >>> From: Goffredo Baroncelli >>> >>> The aim of this new command is to show the physical placement on the disk >>> of a file. >>> Currently it handles all the profiles (single, dup, raid1/10/5/6). >>> >>> The syntax is simple: >> >> Uh... >> Where is the synatx? > > :-) > > The syntax is: > > btrfs inspect-internal physical-find [-l |] > >> >> I guess the synatx is >> physical-find [] >> >>> >>> where: >>> is the file to inspect >>> is the offset of the file to inspect (default 0) >> >> Normally is paired with . >> What about add a new optional parameter ? > > See my next comment > >> Its default value would be the length of the file. >> >> And for the optional , would you mind to make it as an option? >> like -o|--offset and -s|--size ? >> >> For resolve logical directly, then -l|--logical . >> >>> >>> Below some examples: >>> >>> ** Single >>> >>> $ sudo mkfs.btrfs -f -d single -m single /dev/loop0 >>> $ sudo mount /dev/loop0 mnt/ >>> $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null >>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt >>> mnt/out.txt: 0 >> >> So 0 is the offset inside the file. >> And how long that file extent is? >> >>> devid: 1 dev_name: /dev/loop0 offset: 12582912 type: LINEAR >> >> LINEAR seems a little different, as normally we just call it SINGLE in btrfs. > > Right > >> >> And what about changing the output format to the following? >> (This combines both fiemap style and map-logical style) >> ------ >> EXT: FILE-OFFSET LOGICAL RANGE DEVICE DEVICE RANG TYPE >> 0: 0-128K XXXXX-XXXXX 1:/dev/loop0 XXXXX-XXXXX RAID1 >> 2:/dev/loop1 XXXXX-XXXXX RAID1 >> 1: 128K-256K XXXXX-XXXXX 1:/dev/loop2 XXXXX-XXXXX RAID5D1 >> 1:/dev/loop3 XXXXX-XXXXX RAID5D2 >> 1:/dev/loop4 XXXXX-XXXXX RAID5P >> XXXXX-XXXXX 1:/dev/loop2 XXXXX-XXXXX RAID5D1 >> 1:/dev/loop3 XXXXX-XXXXX RAID5D2 >> 1:/dev/loop4 XXXXX-XXXXX RAID5P >> ------ >> Extent 0 and 1 are in different raid profile, it's only possible during convert, just used as an exmple >> And Extent 1 are crossing 2 RAID5 stripes, so needs 2 logical range to show them all. > > This is "quite clear" from an human point of view. But is a nightmare for a script to parse.. And what is missing is > something like "RAID5U" (U==unrelated) for element of the stripe but not of the file Errr, right. It's not friendly for script at all. But not that hard to fix. ------ EXT: FILE-OFFSET LOGICAL RANGE DEVICE DEVICE RANG TYPE 0: 0-131071 XXXXX-XXXXX 1:/dev/loop0 XXXXX-XXXXX RAID1 0: 0-131071 XXXXX-XXXXX 2:/dev/loop1 XXXXX-XXXXX RAID1 1: 131072-196608 XXXXX-XXXXX 3:/dev/loop2 XXXXX-XXXXX RAID5D1 1: 131072-196608 XXXXX-XXXXX 4:/dev/loop3 XXXXX-XXXXX RAID5D2 1: 131072-196608 XXXXX-XXXXX 5:/dev/loop4 XXXXX-XXXXX RAID5P 1: 131072-196608 XXXXX-XXXXX 3:/dev/loop2 XXXXX-XXXXX RAID5D1 1: 131072-196608 XXXXX-XXXXX 4:/dev/loop3 XXXXX-XXXXX RAID5D2 1: 131072-196608 XXXXX-XXXXX 5:/dev/loop4 XXXXX-XXXXX RAID5P ------ Just pend all the extent number,file offset, logical range. And for unrelated data stripe, add a "U" suffix would be good enough. > > >> >> Although it's quite hard to put the above output into 80 characters per line, it provides almost every info we need: >> 1) File offset and its length >> 2) Logical bytenr and its length >> 3) Device bytenr and its length (since its length can differ from logical length) >> 4) RAID type and its role. > > I am not against about your proposal; however I have to point out that the goal of these command was not to *traverse* the file, but only to found the physical location of a file offset. My use case was to simulate a corruption of a raid5 stripe elements: for me it was sufficient to know the page position. For corruption case, the best practice would be extending btrfs-corrupt-block command. And for your original proposal, to locate a page/sector containing the bytenr/offset, then the returned value should always be aligned to sectorsize. (And we need to state it clear in both man page and help string) Unfortunately, that's not the case in current implementation. (And don't forget future subpage sector size, so in that case, we need to check sectorsize first.) For example, if user passes a unaligned logical, physical-find will return the device offset unaligned. If only to locate the stripe/sector, at least returning a aligned number seems more reasonable. IMHO if we only want a simple tool, then make it clear it's a just simple tool, and add limitation and explain to make it simple and won't accept any complext/unexpected input. Or, make it handle unexpected and complex input well. BTW, long time ago, btrfs-map-logical is under the same situation, just a simple tool do off-line logical->device offset mapping. But it since it does provides offset/length pair options, it can cause wrong or uesless result for unaligned input. And we spent some time to improve it. So I hope we can avoid such problem which has already happened in map-logical. > > If you want these information to automate a test, I think that the range concept is more a problem than an help. > > I suggest to add a third command (btrfs insp ranges ?) which show what are you looking. > >> >> >>> $ dd 2>/dev/null if=/dev/loop0 skip=12582912 bs=1 count=5; echo >>> adaaa >>> >>> ** Dup >>> >>> The command shows both the copies >>> >>> $ sudo mkfs.btrfs -f -d single -m single /dev/loop0 >>> $ sudo mount /dev/loop0 mnt/ >>> $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null >>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt >>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt >>> mnt/out.txt: 0 >>> devid: 1 dev_name: /dev/loop0 offset: 71303168 type: DUP >>> devid: 1 dev_name: /dev/loop0 offset: 104857600 type: DUP >>> $ dd 2>/dev/null if=/dev/loop0 skip=104857600 bs=1 count=5 ; echo >>> adaaa >>> >>> ** Raid1 >>> >>> The command shows both the copies >>> >>> $ sudo mkfs.btrfs -f -d raid1 -m raid1 /dev/loop0 /dev/loop1 >>> $ sudo mount /dev/loop0 mnt/ >>> $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null >>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0 >>> devid: 2 dev_name: /dev/loop1 offset: 61865984 type: RAID1 >>> devid: 1 dev_name: /dev/loop0 offset: 81788928 type: RAID1 >>> $ dd 2>/dev/null if=/dev/loop0 skip=81788928 bs=1 count=5; echo >>> adaaa >>> >>> ** Raid10 >>> >>> The command show both the copies; if you set an offset to the next disk-stripe, you can see the next pair of disk-stripe >>> >>> $ sudo mkfs.btrfs -f -d raid10 -m raid10 /dev/loop[0123] >>> $ sudo mount /dev/loop0 mnt/ >>> $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null >>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0 >>> devid: 4 dev_name: /dev/loop3 offset: 61931520 type: RAID10 >>> devid: 3 dev_name: /dev/loop2 offset: 61931520 type: RAID10 >>> $ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5; echo >>> adaaa >>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt 65536 >>> mnt/out.txt: 65536 >>> devid: 2 dev_name: /dev/loop1 offset: 61931520 type: RAID10 >>> devid: 1 dev_name: /dev/loop0 offset: 81854464 type: RAID10 >>> $ dd 2>/dev/null if=/dev/loop0 skip=81854464 bs=1 count=5; echo >>> bdbbb >>> >>> ** Raid5 >>> >>> Depending by the offset, you can see which disk-stripe is used. >>> >>> $ sudo mkfs.btrfs -f -d raid5 -m raid5 /dev/loop[012] >>> $ sudo mount /dev/loop0 mnt/ >>> $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null >>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt >>> mnt/out.txt: 0 >>> devid: 2 dev_name: /dev/loop1 offset: 61931520 type: DATA >>> devid: 1 dev_name: /dev/loop0 offset: 81854464 type: OTHER >>> devid: 3 dev_name: /dev/loop2 offset: 61931520 type: PARITY >> >> Here DATA/OTHER is a little confusing. >> For 4 disks raid5, will it be DATA/OTHER/OTHER and PARITY? >> >> What about RAID5D1 for the first data stripe and RAID5D2 for the second? > > And what about a data-stripe which is not related to the file which we are examining ? Adding a "U" suffix if you like. Or some other character like "*"? BTW, if following this syntax, documentation is also important to explain such suffix. > > > > >> >> So for 4 disks raid5, it will be RAID5D1/D2/D3 and RAID5P (RAID5 PARITY) >> And it's also confusing compared to RAID6. >> >>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt 65536mnt/out.txt: 65536 >>> devid: 2 dev_name: /dev/loop1 offset: 61931520 type: OTHER >>> devid: 1 dev_name: /dev/loop0 offset: 81854464 type: DATA >>> devid: 3 dev_name: /dev/loop2 offset: 61931520 type: PARITY >>> $ dd 2>/dev/null if=/dev/loop1 skip=61931520 bs=1 count=5; echo >>> adaaa >>> $ dd 2>/dev/null if=/dev/loop0 skip=81854464 bs=1 count=5; echo >>> bdbbb >>> $ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5 | xxd >>> 00000000: 0300 0303 03 ..... >>> >>> The parity is computed as: parity=disk1^disk2. So "adaa" ^ "bdbb" == "\x03\x00\x03\x03 >>> >>> ** Raid6 >>> $ sudo mkfs.btrfs -f -mraid6 -draid6 /dev/loop[0-4]^C >>> $ sudo mount /dev/loop0 mnt/ >>> $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null >>> $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt >>> mnt/out.txt: 0 >>> devid: 3 dev_name: /dev/loop2 offset: 61931520 type: DATA >>> devid: 2 dev_name: /dev/loop1 offset: 61931520 type: OTHER >>> devid: 1 dev_name: /dev/loop0 offset: 81854464 type: PARITY >>> devid: 4 dev_name: /dev/loop3 offset: 61931520 type: PARITY >> >> Same like RAID5. >> IMHO RAID6D1/D2... and RAID6P RAID6Q seems better for me. >>> >>> $ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5 ; echo >>> adaaa >>> >>> >>> Signed-off-by: Goffredo Baroncelli >>> --- >>> cmds-inspect.c | 587 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> 1 file changed, 587 insertions(+) >>> >>> diff --git a/cmds-inspect.c b/cmds-inspect.c >>> index dd7b9dd..fc2e7c3 100644 >>> --- a/cmds-inspect.c >>> +++ b/cmds-inspect.c >>> @@ -22,6 +22,11 @@ >>> #include >>> #include >>> #include >>> +#include >>> +#include >>> +#include >>> +#include >>> +#include >>> >>> #include "kerncompat.h" >>> #include "ioctl.h" >>> @@ -623,6 +628,586 @@ out: >>> return !!ret; >>> } >>> >>> + >>> +static const char * const cmd_inspect_physical_find_usage[] = { >>> + "btrfs inspect-internal physical-find [|-l ]", >>> + "Show the physical placement of a file data.", >>> + " file to show", >> >> For resolving logical directly, not the file but any file/dir inside the fs though. > > Ok >> >>> + " file offset to show; 0 if not specified", >>> + " show info about a logical address instead of a file", >> >> Mentioned above, use -o|-s|-l options seems to be a better solution. > > ok >> >>> + "This command requires root privileges", >>> + NULL >>> +}; >>> + >>> +#define STRIPE_INFO_LINEAR 1 >>> +#define STRIPE_INFO_DUP 2 >>> +#define STRIPE_INFO_RAID0 3 >>> +#define STRIPE_INFO_RAID1 4 >>> +#define STRIPE_INFO_RAID10 5 >>> +#define STRIPE_INFO_RAID56_DATA 6 >>> +#define STRIPE_INFO_RAID56_OTHER 7 >>> +#define STRIPE_INFO_RAID56_PARITY 8 >> >> Mentioned before. >> And since the STRIPE_INFO_* macro is only used in outputting the string, I prefer to do it in a helper function with if branches. > > ok > >> >>> + >>> +static const char * const stripe_info_descr[] = { >>> + [STRIPE_INFO_LINEAR] = "LINEAR", >>> + [STRIPE_INFO_DUP] = "DUP", >>> + [STRIPE_INFO_RAID0] = "RAID0", >>> + [STRIPE_INFO_RAID1] = "RAID1", >>> + [STRIPE_INFO_RAID10] = "RAID10", >>> + [STRIPE_INFO_RAID56_DATA] = "DATA", >>> + [STRIPE_INFO_RAID56_OTHER] = "OTHER", >>> + [STRIPE_INFO_RAID56_PARITY] = "PARITY", >>> +}; >>> + >>> +struct stripe_info { >>> + u64 devid; >>> + const char *dname; >>> + u64 phy_start; >>> + int type; >> >> IMHO "dname" contains all the neede info for the role of the stripe. >> So "type" is useless here for me though. > > Sorry I can't understand you: dname is the device name; its role depends by > several factors, so I add also the type field. My fault, I just got confused and though dname is just the string output of type. And I mean to replace "int type" with "char *type". Since if following the new "RAID5D1" "RAID5D2U" syntax, type can't handle such output. > >> >> And it's better to add a "u32 phy_length" to show how long the stripe is. >> >>> +}; >>> + >>> +static void add_stripe_info(struct stripe_info **list, int *count, >>> + u64 devid, const char *dname, u64 phy_start, int type) { >>> + >>> + if (*list == NULL) >>> + *count = 0; >>> + >>> + ++*count; >>> + *list = realloc(*list, sizeof(struct stripe_info) * *count); >>> + /* >>> + * It is rude, but it should not happen for this kind of allocation... >>> + * ... and anyway when it happens, there are more severe problems >>> + * that this handling of "not enough memory" >>> + */ >>> + if (*list == NULL) { >>> + error("Not nough memory: abort\n"); >>> + exit(100); >> >> Same exit value problem here. > > ok > [...] > > >>> + >>> + } else if (chunk->type & BTRFS_BLOCK_GROUP_RAID1) { >>> + /* >>> + * RAID0: each chunk is composed by more disks; >>> + * each stripe_len bytes are in a different disk: >>> + * >>> + * file: ABC... >>> + * >>> + * disk1 disk2 disk3 .... >>> + * >>> + * A A >>> + * B B >>> + * C C >>> + * >>> + */ >> >> Here btrfs raid1 is more flex than normal RAID1 implement. >> >> Better comment would be: >> Disk1 Disk2 Disk3 >> A A B >> B C C > > ok > >> >> And that's the real case for 3 disks RAID1 (for same disk size). > > > [...] >>> +static int cmd_inspect_physical_find(int argc, char **argv) >>> +{ >>> + int ret = 0; > [...] >>> + >>> + check_root_or_exit(); >>> + check_btrfs_or_exit(fname); >> >> If we call get_fs_info(), is it really needed to check btrfs early? > > The two above are the mains reasons of failure of these command. So I > preferred to add a clear check about which property we want. > I think that is more clear a statemenmt like: > "You need to be root to execute this command" > instead of a generic EPERM: the user could think that it is > sufficent to change the permission of the file > Makes sense. This seems to be a more personal preference then. Report possible error first or report error when it happens. Although possible report error first means we need to keep the early check the same as the real function. Any way, I'm OK if you want to keep it. Thanks, Qu > >> >> Thanks, >> Qu > > [...] >