* User space RAID-6 access @ 2011-01-31 20:20 Piergiorgio Sartor 2011-01-31 20:52 ` NeilBrown 0 siblings, 1 reply; 14+ messages in thread From: Piergiorgio Sartor @ 2011-01-31 20:20 UTC (permalink / raw) To: linux-raid Hi all, some times ago, I think was Neil, it was mentioned about the possibility to access consistently a RAID-6 array from user space, in order to be able to perform some checks, like the notorius "which HDD has wrong data". Is there any reference or documentation or source code which can be taken as example for such a case? Thanks, bye, -- piergiorgio ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: User space RAID-6 access 2011-01-31 20:20 User space RAID-6 access Piergiorgio Sartor @ 2011-01-31 20:52 ` NeilBrown 2011-02-01 19:21 ` Piergiorgio Sartor 2011-02-05 17:33 ` Piergiorgio Sartor 0 siblings, 2 replies; 14+ messages in thread From: NeilBrown @ 2011-01-31 20:52 UTC (permalink / raw) To: Piergiorgio Sartor; +Cc: linux-raid On Mon, 31 Jan 2011 21:20:55 +0100 Piergiorgio Sartor <piergiorgio.sartor@nexgo.de> wrote: > Hi all, > > some times ago, I think was Neil, it was mentioned > about the possibility to access consistently a RAID-6 > array from user space, in order to be able to perform > some checks, like the notorius "which HDD has wrong data". > > Is there any reference or documentation or source code > which can be taken as example for such a case? > Look in the mdadm source code, particularly at restripe.c Also make test_stripe make a program the the test suite uses for verify data correctness. That should give you enough hints to get you started. NeilBrown ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: User space RAID-6 access 2011-01-31 20:52 ` NeilBrown @ 2011-02-01 19:21 ` Piergiorgio Sartor 2011-02-01 20:14 ` John Robinson 2011-02-01 20:18 ` NeilBrown 2011-02-05 17:33 ` Piergiorgio Sartor 1 sibling, 2 replies; 14+ messages in thread From: Piergiorgio Sartor @ 2011-02-01 19:21 UTC (permalink / raw) To: NeilBrown; +Cc: Piergiorgio Sartor, linux-raid On Tue, Feb 01, 2011 at 07:52:59AM +1100, NeilBrown wrote: > On Mon, 31 Jan 2011 21:20:55 +0100 Piergiorgio Sartor > <piergiorgio.sartor@nexgo.de> wrote: > > > Hi all, > > > > some times ago, I think was Neil, it was mentioned > > about the possibility to access consistently a RAID-6 > > array from user space, in order to be able to perform > > some checks, like the notorius "which HDD has wrong data". > > > > Is there any reference or documentation or source code > > which can be taken as example for such a case? > > > > Look in the mdadm source code, particularly at restripe.c > > Also > make test_stripe > > make a program the the test suite uses for verify data correctness. > > That should give you enough hints to get you started. Hi Neil, thanks for the pointer. I had a look at the code and there is something I did not get. It seems to me the HDDs composing the array are "simply" opened with "open". In case the array is in use, how is avoided a race condition between the "test_stripe" program and the md device? I mean, the first could start to read from the HDDs and the other could write something in the same place, leading to an inconsistent parity, from the "test_stripe" point of view, since it missed some update. Or there is a locking inside md (in "test_stripe" I could not see and it would also be dangerous)? Thanks again, bye, -- piergiorgio ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: User space RAID-6 access 2011-02-01 19:21 ` Piergiorgio Sartor @ 2011-02-01 20:14 ` John Robinson 2011-02-01 20:18 ` NeilBrown 1 sibling, 0 replies; 14+ messages in thread From: John Robinson @ 2011-02-01 20:14 UTC (permalink / raw) To: Piergiorgio Sartor; +Cc: NeilBrown, linux-raid On 01/02/2011 19:21, Piergiorgio Sartor wrote: > On Tue, Feb 01, 2011 at 07:52:59AM +1100, NeilBrown wrote: >> On Mon, 31 Jan 2011 21:20:55 +0100 Piergiorgio Sartor >> <piergiorgio.sartor@nexgo.de> wrote: >> >>> Hi all, >>> >>> some times ago, I think was Neil, it was mentioned >>> about the possibility to access consistently a RAID-6 >>> array from user space, in order to be able to perform >>> some checks, like the notorius "which HDD has wrong data". >>> >>> Is there any reference or documentation or source code >>> which can be taken as example for such a case? >>> >> >> Look in the mdadm source code, particularly at restripe.c >> >> Also >> make test_stripe >> >> make a program the the test suite uses for verify data correctness. >> >> That should give you enough hints to get you started. > > > Hi Neil, thanks for the pointer. > > I had a look at the code and there is something I did not get. > > It seems to me the HDDs composing the array are "simply" > opened with "open". > > In case the array is in use, how is avoided a race condition > between the "test_stripe" program and the md device? > > I mean, the first could start to read from the HDDs and the > other could write something in the same place, leading to an > inconsistent parity, from the "test_stripe" point of view, > since it missed some update. > > Or there is a locking inside md (in "test_stripe" I could > not see and it would also be dangerous)? Look at Grow.c and suspend_lo and suspend_hi. Just my €$£0.005, I may be wrong, etc. Cheers, John. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: User space RAID-6 access 2011-02-01 19:21 ` Piergiorgio Sartor 2011-02-01 20:14 ` John Robinson @ 2011-02-01 20:18 ` NeilBrown 2011-02-01 21:00 ` Piergiorgio Sartor 1 sibling, 1 reply; 14+ messages in thread From: NeilBrown @ 2011-02-01 20:18 UTC (permalink / raw) To: Piergiorgio Sartor; +Cc: linux-raid On Tue, 1 Feb 2011 20:21:54 +0100 Piergiorgio Sartor <piergiorgio.sartor@nexgo.de> wrote: > On Tue, Feb 01, 2011 at 07:52:59AM +1100, NeilBrown wrote: > > On Mon, 31 Jan 2011 21:20:55 +0100 Piergiorgio Sartor > > <piergiorgio.sartor@nexgo.de> wrote: > > > > > Hi all, > > > > > > some times ago, I think was Neil, it was mentioned > > > about the possibility to access consistently a RAID-6 > > > array from user space, in order to be able to perform > > > some checks, like the notorius "which HDD has wrong data". > > > > > > Is there any reference or documentation or source code > > > which can be taken as example for such a case? > > > > > > > Look in the mdadm source code, particularly at restripe.c > > > > Also > > make test_stripe > > > > make a program the the test suite uses for verify data correctness. > > > > That should give you enough hints to get you started. > > > Hi Neil, thanks for the pointer. > > I had a look at the code and there is something I did not get. > > It seems to me the HDDs composing the array are "simply" > opened with "open". > > In case the array is in use, how is avoided a race condition > between the "test_stripe" program and the md device? > > I mean, the first could start to read from the HDDs and the > other could write something in the same place, leading to an > inconsistent parity, from the "test_stripe" point of view, > since it missed some update. > > Or there is a locking inside md (in "test_stripe" I could > not see and it would also be dangerous)? > > Thanks again, > > bye, > I didn't realise that you wanted to look at the members of the array while the array was active!! That is a bit harder. But not impossible. If you write a couple of numbers to 'suspend_lo' and 'suspend_hi' in sysfs, then writes to blocks between those two array addresses will be blocked. So you could suspend a region, look at the blocks, then un-suspend. NeilBrown ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: User space RAID-6 access 2011-02-01 20:18 ` NeilBrown @ 2011-02-01 21:00 ` Piergiorgio Sartor 0 siblings, 0 replies; 14+ messages in thread From: Piergiorgio Sartor @ 2011-02-01 21:00 UTC (permalink / raw) To: NeilBrown; +Cc: Piergiorgio Sartor, linux-raid > I didn't realise that you wanted to look at the members of the array while > the array was active!! That is a bit harder. But not impossible. Well, let's say that would be optimal, but even off-line could be OK. BTW, what does it mean "off-line"? Does it mean that /dev/mdX exists, but it is not used (un-mounted), or it is just read-only, or it should *not* exists and only the componing HDDs are available? > If you write a couple of numbers to 'suspend_lo' and 'suspend_hi' in sysfs, > then writes to blocks between those two array addresses will be blocked. > So you could suspend a region, look at the blocks, then un-suspend. Oh, thanks, I will consider the option. bye, -- piergiorgio ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: User space RAID-6 access 2011-01-31 20:52 ` NeilBrown 2011-02-01 19:21 ` Piergiorgio Sartor @ 2011-02-05 17:33 ` Piergiorgio Sartor 2011-02-05 20:58 ` NeilBrown 1 sibling, 1 reply; 14+ messages in thread From: Piergiorgio Sartor @ 2011-02-05 17:33 UTC (permalink / raw) To: NeilBrown; +Cc: Piergiorgio Sartor, linux-raid > Look in the mdadm source code, particularly at restripe.c > > Also > make test_stripe > > make a program the the test suite uses for verify data correctness. > > That should give you enough hints to get you started. Hi Neil, I'm trying to use the "test_stripe" binary, just to confirm the operation, in test mode (no save nor restore). Unfortunately, it seems not work properly, in my hands. I created a 4 disks RAID-6 (/dev/loop[0-3]), with: mdadm -C /dev/md111 -l 6 -n 4 --chunk=64 /dev/loop[0-3] Filled the array from urandom: dd if=/dev/urandom of=/dev/md111 And tried: ./test_stripe test file_test.raw 4 65536 6 2 65536 $[ 65536 * 3 ] /dev/loop[0-3] This returns: 0->0 1->1 P(2) wrong at 1 Q(3) wrong at 1 0->3 1->0 P(1) wrong at 2 Q(2) wrong at 2 0->2 1->3 P(0) wrong at 3 Q(1) wrong at 3 The array filled with "0" does not return any error. Am'I missing something or the code has problems? An other question, I noticed the code uses an array of "offsets" which seem to be filled with "0" and never changed. Is this really wanted? Is the offset information the one found per component using "mdadm -E ..." or /sys/class/block/mdX/md/rdX/offset? What's the relation with the "start" parameter of "test_stripe"? Thanks in advance for any hints or suggestions. The md device is: /dev/md111: Version : 1.2 Creation Time : Sat Feb 5 17:48:49 2011 Raid Level : raid6 Array Size : 524160 (511.96 MiB 536.74 MB) Used Dev Size : 262080 (255.98 MiB 268.37 MB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Update Time : Sat Feb 5 18:19:45 2011 State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 64K Name : lazy.lzy:111 (local to host lazy.lzy) UUID : b794fb80:c4006853:47761570:3d97a1d2 Events : 42 Number Major Minor RaidDevice State 0 7 0 0 active sync /dev/loop0 1 7 1 1 active sync /dev/loop1 2 7 2 2 active sync /dev/loop2 3 7 3 3 active sync /dev/loop3 bye, -- piergiorgio ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: User space RAID-6 access 2011-02-05 17:33 ` Piergiorgio Sartor @ 2011-02-05 20:58 ` NeilBrown 2011-02-07 22:24 ` [PATCH] " Piergiorgio Sartor 0 siblings, 1 reply; 14+ messages in thread From: NeilBrown @ 2011-02-05 20:58 UTC (permalink / raw) To: Piergiorgio Sartor; +Cc: linux-raid On Sat, 5 Feb 2011 18:33:34 +0100 Piergiorgio Sartor <piergiorgio.sartor@nexgo.de> wrote: > > Look in the mdadm source code, particularly at restripe.c > > > > Also > > make test_stripe > > > > make a program the the test suite uses for verify data correctness. > > > > That should give you enough hints to get you started. > > Hi Neil, > > I'm trying to use the "test_stripe" binary, just to > confirm the operation, in test mode (no save nor restore). > > Unfortunately, it seems not work properly, in my hands. > > I created a 4 disks RAID-6 (/dev/loop[0-3]), with: > > mdadm -C /dev/md111 -l 6 -n 4 --chunk=64 /dev/loop[0-3] > > Filled the array from urandom: > > dd if=/dev/urandom of=/dev/md111 > > And tried: > > ./test_stripe test file_test.raw 4 65536 6 2 65536 $[ 65536 * 3 ] /dev/loop[0-3] > > This returns: > > 0->0 > 1->1 > P(2) wrong at 1 > Q(3) wrong at 1 > 0->3 > 1->0 > P(1) wrong at 2 > Q(2) wrong at 2 > 0->2 > 1->3 > P(0) wrong at 3 > Q(1) wrong at 3 > > The array filled with "0" does not return any error. > > Am'I missing something or the code has problems? > > An other question, I noticed the code uses an array of > "offsets" which seem to be filled with "0" and never > changed. > Is this really wanted? > Is the offset information the one found per component > using "mdadm -E ..." or /sys/class/block/mdX/md/rdX/offset? > What's the relation with the "start" parameter of "test_stripe"? > > Thanks in advance for any hints or suggestions. test_stripe assumes that the data starts at the start of each device. AS you are using 1.2 metadata (the default), data starts about 1M in to the device (I think - you can check with --examine) You could fix test_stripe to put the right value in the 'offsets' array, or you could create the array with 1.0 or 0.90 metadata. NeilBrown > > The md device is: > > /dev/md111: > Version : 1.2 > Creation Time : Sat Feb 5 17:48:49 2011 > Raid Level : raid6 > Array Size : 524160 (511.96 MiB 536.74 MB) > Used Dev Size : 262080 (255.98 MiB 268.37 MB) > Raid Devices : 4 > Total Devices : 4 > Persistence : Superblock is persistent > > Update Time : Sat Feb 5 18:19:45 2011 > State : clean > Active Devices : 4 > Working Devices : 4 > Failed Devices : 0 > Spare Devices : 0 > > Layout : left-symmetric > Chunk Size : 64K > > Name : lazy.lzy:111 (local to host lazy.lzy) > UUID : b794fb80:c4006853:47761570:3d97a1d2 > Events : 42 > > Number Major Minor RaidDevice State > 0 7 0 0 active sync /dev/loop0 > 1 7 1 1 active sync /dev/loop1 > 2 7 2 2 active sync /dev/loop2 > 3 7 3 3 active sync /dev/loop3 > > bye, > ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH] Re: User space RAID-6 access 2011-02-05 20:58 ` NeilBrown @ 2011-02-07 22:24 ` Piergiorgio Sartor 2011-02-07 22:49 ` NeilBrown 0 siblings, 1 reply; 14+ messages in thread From: Piergiorgio Sartor @ 2011-02-07 22:24 UTC (permalink / raw) To: NeilBrown; +Cc: Piergiorgio Sartor, linux-raid > test_stripe assumes that the data starts at the start of each device. > AS you are using 1.2 metadata (the default), data starts about 1M in to > the device (I think - you can check with --examine) > > You could fix test_stripe to put the right value in the 'offsets' array, > or you could create the array with 1.0 or 0.90 metadata. Hi Neil, thanks for the info, maybe this should be a second patch. In the meantime, please find attached a patch to restripe.c of mdadm 3.2 (latest, I hope). This should add the functionality to detect, in RAID-6, which of the disks potentially has problems, in case of parity errors. Some checks take place in order to avoid false positives, I hope these are correct and enough. I'm not 100% happy of the interface (too much redundancy), but for the time being it could be OK. Of course, any improvement is welcome. Please consider to include these changes to the next mdadm whatever release. bye, --- restripe.c.org 2011-01-13 05:52:15.000000000 +0100 +++ restripe.c 2011-02-07 22:51:01.539471472 +0100 @@ -285,10 +285,13 @@ uint8_t raid6_gfexp[256]; uint8_t raid6_gfinv[256]; uint8_t raid6_gfexi[256]; +uint8_t raid6_gflog[256]; +uint8_t raid6_gfilog[256]; void make_tables(void) { int i, j; uint8_t v; + uint32_t b, log; /* Compute multiplication table */ for (i = 0; i < 256; i++) @@ -312,6 +315,19 @@ for (i = 0; i < 256; i ++) raid6_gfexi[i] = raid6_gfinv[raid6_gfexp[i] ^ 1]; + /* Compute log and inverse log */ + /* Modified code from: http://web.eecs.utk.edu/~plank/plank/papers/CS-96-332.html */ + b = 1; + raid6_gflog[0] = 0; + raid6_gfilog[255] = 0; + + for (log = 0; log < 255; log++) { + raid6_gflog[b] = (uint8_t) log; + raid6_gfilog[log] = (uint8_t) b; + b = b << 1; + if (b & 256) b = b ^ 0435; + } + tables_ready = 1; } @@ -387,6 +403,65 @@ } } +/* Try to find out if a specific disk has a problem */ +int raid6_check_disks(int data_disks, int start, int chunk_size, int level, int layout, int diskP, int diskQ, char *p, char *q, char **stripes) +{ + int i; + int data_id, diskD; + uint8_t Px, Qx; + int curr_broken_disk = -1; + int prev_broken_disk = -1; + int broken_status = 0; + + for(i = 0; i < chunk_size; i++) { + Px = (uint8_t)stripes[diskP][i] ^ (uint8_t)p[i]; + Qx = (uint8_t)stripes[diskQ][i] ^ (uint8_t)q[i]; + + if((Px != 0) && (Qx == 0)) { + curr_broken_disk = diskP; + } + + if((Px == 0) && (Qx != 0)) { + curr_broken_disk = diskQ; + } + + if((Px != 0) && (Qx != 0)) { + data_id = (raid6_gflog[Qx] - raid6_gflog[Px]) & 0xFF; + diskD = geo_map(data_id, start/chunk_size, data_disks + 2, level, layout); + curr_broken_disk = diskD; + } + + if((Px == 0) && (Qx == 0)) { + curr_broken_disk = curr_broken_disk; + } + + switch(broken_status) { + case 0: + if(curr_broken_disk != -1) { + prev_broken_disk = curr_broken_disk; + broken_status = 1; + } + break; + + case 1: + if(curr_broken_disk != prev_broken_disk) { + broken_status = 2; + } + if(curr_broken_disk >= data_disks + 2) { + broken_status = 2; + } + break; + + case 2: + default: + curr_broken_disk = prev_broken_disk = -2; + break; + } + } + + return curr_broken_disk; +} + /* Save data: * We are given: * A list of 'fds' of the active disks. Some may be absent. @@ -673,7 +748,12 @@ char *q = malloc(chunk_size); int i; + int diskP, diskQ; int data_disks = raid_disks - (level == 5 ? 1: 2); + + if (!tables_ready) + make_tables(); + for ( i = 0 ; i < raid_disks ; i++) stripes[i] = stripe_buf + i * chunk_size; @@ -693,18 +773,25 @@ switch(level) { case 6: qsyndrome(p, q, (uint8_t**)blocks, data_disks, chunk_size); - disk = geo_map(-1, start/chunk_size, raid_disks, + diskP = geo_map(-1, start/chunk_size, raid_disks, level, layout); - if (memcmp(p, stripes[disk], chunk_size) != 0) { - printf("P(%d) wrong at %llu\n", disk, + if (memcmp(p, stripes[diskP], chunk_size) != 0) { + printf("P(%d) wrong at %llu\n", diskP, start / chunk_size); } - disk = geo_map(-2, start/chunk_size, raid_disks, + diskQ = geo_map(-2, start/chunk_size, raid_disks, level, layout); - if (memcmp(q, stripes[disk], chunk_size) != 0) { - printf("Q(%d) wrong at %llu\n", disk, + if (memcmp(q, stripes[diskQ], chunk_size) != 0) { + printf("Q(%d) wrong at %llu\n", diskQ, start / chunk_size); } + disk = raid6_check_disks(data_disks, start, chunk_size, level, layout, diskP, diskQ, p, q, stripes); + if(disk >= 0) { + printf("Possible failed disk: %d\n", disk); + } + if(disk == -2) { + printf("Failure detected, but disk unknown\n"); + } break; } length -= chunk_size; -- piergiorgio ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] Re: User space RAID-6 access 2011-02-07 22:24 ` [PATCH] " Piergiorgio Sartor @ 2011-02-07 22:49 ` NeilBrown 2011-02-09 18:47 ` Piergiorgio Sartor 0 siblings, 1 reply; 14+ messages in thread From: NeilBrown @ 2011-02-07 22:49 UTC (permalink / raw) To: Piergiorgio Sartor; +Cc: linux-raid On Mon, 7 Feb 2011 23:24:59 +0100 Piergiorgio Sartor <piergiorgio.sartor@nexgo.de> wrote: > > test_stripe assumes that the data starts at the start of each device. > > AS you are using 1.2 metadata (the default), data starts about 1M in to > > the device (I think - you can check with --examine) > > > > You could fix test_stripe to put the right value in the 'offsets' array, > > or you could create the array with 1.0 or 0.90 metadata. > > Hi Neil, > > thanks for the info, maybe this should be a second patch. > > In the meantime, please find attached a patch to restripe.c > of mdadm 3.2 (latest, I hope). > > This should add the functionality to detect, in RAID-6, > which of the disks potentially has problems, in case of > parity errors. > > Some checks take place in order to avoid false positives, > I hope these are correct and enough. > > I'm not 100% happy of the interface (too much redundancy), > but for the time being it could be OK. > > Of course, any improvement is welcome. > > Please consider to include these changes to the next mdadm > whatever release. > Thanks a lot! I have applied some patch - with some formatting changes to make it consistent with the rest of the code. I don't really have time to look more deeply at it at the moment. Maybe someone else will?... Thanks, NeilBrown > bye, > > --- restripe.c.org 2011-01-13 05:52:15.000000000 +0100 > +++ restripe.c 2011-02-07 22:51:01.539471472 +0100 > @@ -285,10 +285,13 @@ > uint8_t raid6_gfexp[256]; > uint8_t raid6_gfinv[256]; > uint8_t raid6_gfexi[256]; > +uint8_t raid6_gflog[256]; > +uint8_t raid6_gfilog[256]; > void make_tables(void) > { > int i, j; > uint8_t v; > + uint32_t b, log; > > /* Compute multiplication table */ > for (i = 0; i < 256; i++) > @@ -312,6 +315,19 @@ > for (i = 0; i < 256; i ++) > raid6_gfexi[i] = raid6_gfinv[raid6_gfexp[i] ^ 1]; > > + /* Compute log and inverse log */ > + /* Modified code from: http://web.eecs.utk.edu/~plank/plank/papers/CS-96-332.html */ > + b = 1; > + raid6_gflog[0] = 0; > + raid6_gfilog[255] = 0; > + > + for (log = 0; log < 255; log++) { > + raid6_gflog[b] = (uint8_t) log; > + raid6_gfilog[log] = (uint8_t) b; > + b = b << 1; > + if (b & 256) b = b ^ 0435; > + } > + > tables_ready = 1; > } > > @@ -387,6 +403,65 @@ > } > } > > +/* Try to find out if a specific disk has a problem */ > +int raid6_check_disks(int data_disks, int start, int chunk_size, int level, int layout, int diskP, int diskQ, char *p, char *q, char **stripes) > +{ > + int i; > + int data_id, diskD; > + uint8_t Px, Qx; > + int curr_broken_disk = -1; > + int prev_broken_disk = -1; > + int broken_status = 0; > + > + for(i = 0; i < chunk_size; i++) { > + Px = (uint8_t)stripes[diskP][i] ^ (uint8_t)p[i]; > + Qx = (uint8_t)stripes[diskQ][i] ^ (uint8_t)q[i]; > + > + if((Px != 0) && (Qx == 0)) { > + curr_broken_disk = diskP; > + } > + > + if((Px == 0) && (Qx != 0)) { > + curr_broken_disk = diskQ; > + } > + > + if((Px != 0) && (Qx != 0)) { > + data_id = (raid6_gflog[Qx] - raid6_gflog[Px]) & 0xFF; > + diskD = geo_map(data_id, start/chunk_size, data_disks + 2, level, layout); > + curr_broken_disk = diskD; > + } > + > + if((Px == 0) && (Qx == 0)) { > + curr_broken_disk = curr_broken_disk; > + } > + > + switch(broken_status) { > + case 0: > + if(curr_broken_disk != -1) { > + prev_broken_disk = curr_broken_disk; > + broken_status = 1; > + } > + break; > + > + case 1: > + if(curr_broken_disk != prev_broken_disk) { > + broken_status = 2; > + } > + if(curr_broken_disk >= data_disks + 2) { > + broken_status = 2; > + } > + break; > + > + case 2: > + default: > + curr_broken_disk = prev_broken_disk = -2; > + break; > + } > + } > + > + return curr_broken_disk; > +} > + > /* Save data: > * We are given: > * A list of 'fds' of the active disks. Some may be absent. > @@ -673,7 +748,12 @@ > char *q = malloc(chunk_size); > > int i; > + int diskP, diskQ; > int data_disks = raid_disks - (level == 5 ? 1: 2); > + > + if (!tables_ready) > + make_tables(); > + > for ( i = 0 ; i < raid_disks ; i++) > stripes[i] = stripe_buf + i * chunk_size; > > @@ -693,18 +773,25 @@ > switch(level) { > case 6: > qsyndrome(p, q, (uint8_t**)blocks, data_disks, chunk_size); > - disk = geo_map(-1, start/chunk_size, raid_disks, > + diskP = geo_map(-1, start/chunk_size, raid_disks, > level, layout); > - if (memcmp(p, stripes[disk], chunk_size) != 0) { > - printf("P(%d) wrong at %llu\n", disk, > + if (memcmp(p, stripes[diskP], chunk_size) != 0) { > + printf("P(%d) wrong at %llu\n", diskP, > start / chunk_size); > } > - disk = geo_map(-2, start/chunk_size, raid_disks, > + diskQ = geo_map(-2, start/chunk_size, raid_disks, > level, layout); > - if (memcmp(q, stripes[disk], chunk_size) != 0) { > - printf("Q(%d) wrong at %llu\n", disk, > + if (memcmp(q, stripes[diskQ], chunk_size) != 0) { > + printf("Q(%d) wrong at %llu\n", diskQ, > start / chunk_size); > } > + disk = raid6_check_disks(data_disks, start, chunk_size, level, layout, diskP, diskQ, p, q, stripes); > + if(disk >= 0) { > + printf("Possible failed disk: %d\n", disk); > + } > + if(disk == -2) { > + printf("Failure detected, but disk unknown\n"); > + } > break; > } > length -= chunk_size; > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] Re: User space RAID-6 access 2011-02-07 22:49 ` NeilBrown @ 2011-02-09 18:47 ` Piergiorgio Sartor 2011-02-17 6:23 ` NeilBrown 0 siblings, 1 reply; 14+ messages in thread From: Piergiorgio Sartor @ 2011-02-09 18:47 UTC (permalink / raw) To: NeilBrown; +Cc: Piergiorgio Sartor, linux-raid > I have applied some patch - with some formatting changes to make it consistent > with the rest of the code. > > I don't really have time to look more deeply at it at the moment. > Maybe someone else will?... Hi Neil, thanks for including this in git. Actually I did it look at it :-) and I already found a couple of issues, below is a small fix patch. One question about offsets. I noticed that offsets seems to be defined per component, in the md software stack. I mean, looking at sysfs, the offset are inside md/dev-sdX/offset or available with mdadm -E /dev/sdX It seems possible that different components have different offset (I can imagine this happening with different sizes). Now, from inside "test_stripe" there is no information about the RAID itself, I mean the device, nor about the metadata type. What would be the best way to get the offsets inside "restripe.c"? 1) popen "mdadm -E ..." and grep the "Data Offset" field. 2) pass one single offset as command line parameter. 3) add the whole md device as command line parameter, removing the single devices, and use sysfs. Thanks for any advice, bye, pg diff -uN a/restripe.c b/restripe.c --- a/restripe.c 2011-02-09 19:31:18.989495816 +0100 +++ b/restripe.c 2011-02-09 19:32:42.597955058 +0100 @@ -430,7 +430,8 @@ if((Px != 0) && (Qx != 0)) { - data_id = (raid6_gflog[Qx] - raid6_gflog[Px]) & 0xFF; + data_id = (raid6_gflog[Qx] - raid6_gflog[Px]); + if(data_id < 0) data_id += 255; diskD = geo_map(data_id, start/chunk_size, data_disks + 2, level, layout); curr_broken_disk = diskD; @@ -439,6 +440,9 @@ if((Px == 0) && (Qx == 0)) curr_broken_disk = curr_broken_disk; + if(curr_broken_disk >= data_disks + 2) + broken_status = 2; + switch(broken_status) { case 0: if(curr_broken_disk != -1) { @@ -450,10 +454,6 @@ case 1: if(curr_broken_disk != prev_broken_disk) broken_status = 2; - - if(curr_broken_disk >= data_disks + 2) - broken_status = 2; - break; case 2: bye, -- piergiorgio ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] Re: User space RAID-6 access 2011-02-09 18:47 ` Piergiorgio Sartor @ 2011-02-17 6:23 ` NeilBrown 2011-02-17 20:01 ` Piergiorgio Sartor 2011-02-18 23:02 ` Piergiorgio Sartor 0 siblings, 2 replies; 14+ messages in thread From: NeilBrown @ 2011-02-17 6:23 UTC (permalink / raw) To: Piergiorgio Sartor; +Cc: linux-raid On Wed, 9 Feb 2011 19:47:34 +0100 Piergiorgio Sartor <piergiorgio.sartor@nexgo.de> wrote: > > I have applied some patch - with some formatting changes to make it consistent > > with the rest of the code. > > > > I don't really have time to look more deeply at it at the moment. > > Maybe someone else will?... > > Hi Neil, > > thanks for including this in git. > > Actually I did it look at it :-) and I already found a > couple of issues, below is a small fix patch. Sorry for delay - I have applied this now. > > One question about offsets. > I noticed that offsets seems to be defined per component, > in the md software stack. > I mean, looking at sysfs, the offset are inside md/dev-sdX/offset > or available with mdadm -E /dev/sdX > It seems possible that different components have different > offset (I can imagine this happening with different sizes). > > Now, from inside "test_stripe" there is no information about > the RAID itself, I mean the device, nor about the metadata type. > > What would be the best way to get the offsets inside "restripe.c"? > > 1) popen "mdadm -E ..." and grep the "Data Offset" field. > 2) pass one single offset as command line parameter. > 3) add the whole md device as command line parameter, removing > the single devices, and use sysfs. As this is really just for testing, I would add a :offset suffix to the device names where the offset is in sectors. So: /dev/sda means read from the start of /dev/sda /dev/sda:2048 means start 1Meg into /dev/sda. That should be easy to parse and is quite general. If you wanted to create separate functionality that could be given an active md device and would read from the component devices, I would link with sysfs.c and use those routines to read info from sysfs and use that to guide the code which finds the data. Make sense? Thanks, NeilBrown > > Thanks for any advice, > > bye, > > pg > > > diff -uN a/restripe.c b/restripe.c > --- a/restripe.c 2011-02-09 19:31:18.989495816 +0100 > +++ b/restripe.c 2011-02-09 19:32:42.597955058 +0100 > @@ -430,7 +430,8 @@ > > > if((Px != 0) && (Qx != 0)) { > - data_id = (raid6_gflog[Qx] - raid6_gflog[Px]) & 0xFF; > + data_id = (raid6_gflog[Qx] - raid6_gflog[Px]); > + if(data_id < 0) data_id += 255; > diskD = geo_map(data_id, start/chunk_size, > data_disks + 2, level, layout); > curr_broken_disk = diskD; > @@ -439,6 +440,9 @@ > if((Px == 0) && (Qx == 0)) > curr_broken_disk = curr_broken_disk; > > + if(curr_broken_disk >= data_disks + 2) > + broken_status = 2; > + > switch(broken_status) { > case 0: > if(curr_broken_disk != -1) { > @@ -450,10 +454,6 @@ > case 1: > if(curr_broken_disk != prev_broken_disk) > broken_status = 2; > - > - if(curr_broken_disk >= data_disks + 2) > - broken_status = 2; > - > break; > > case 2: > > bye, > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] Re: User space RAID-6 access 2011-02-17 6:23 ` NeilBrown @ 2011-02-17 20:01 ` Piergiorgio Sartor 2011-02-18 23:02 ` Piergiorgio Sartor 1 sibling, 0 replies; 14+ messages in thread From: Piergiorgio Sartor @ 2011-02-17 20:01 UTC (permalink / raw) To: NeilBrown; +Cc: Piergiorgio Sartor, linux-raid On Thu, Feb 17, 2011 at 05:23:32PM +1100, NeilBrown wrote: > On Wed, 9 Feb 2011 19:47:34 +0100 Piergiorgio Sartor > <piergiorgio.sartor@nexgo.de> wrote: > > > > I have applied some patch - with some formatting changes to make it consistent > > > with the rest of the code. > > > > > > I don't really have time to look more deeply at it at the moment. > > > Maybe someone else will?... > > > > Hi Neil, > > > > thanks for including this in git. > > > > Actually I did it look at it :-) and I already found a > > couple of issues, below is a small fix patch. > > Sorry for delay - I have applied this now. No, problem, I know it could happen, just concerned about the quality of the patch... :-) > > One question about offsets. > > I noticed that offsets seems to be defined per component, > > in the md software stack. > > I mean, looking at sysfs, the offset are inside md/dev-sdX/offset > > or available with mdadm -E /dev/sdX > > It seems possible that different components have different > > offset (I can imagine this happening with different sizes). > > > > Now, from inside "test_stripe" there is no information about > > the RAID itself, I mean the device, nor about the metadata type. > > > > What would be the best way to get the offsets inside "restripe.c"? > > > > 1) popen "mdadm -E ..." and grep the "Data Offset" field. > > 2) pass one single offset as command line parameter. > > 3) add the whole md device as command line parameter, removing > > the single devices, and use sysfs. > > As this is really just for testing, I would add a :offset suffix to the > device names where the offset is in sectors. > So: > /dev/sda > means read from the start of /dev/sda > > /dev/sda:2048 > > means start 1Meg into /dev/sda. > > That should be easy to parse and is quite general. > > If you wanted to create separate functionality that could be given an active > md device and would read from the component devices, I would link with > sysfs.c and use those routines to read info from sysfs and use that to guide > the code which finds the data. > > Make sense? Thanks for the hint, I guess it makes a lot of sense. I'll try the ":", it seems to me reasonable for the time being. Eventually, I could consider to write a separate piece of code linking restripe.c and sysfs.c together. Thanks again, bye, pg > > bye, > > > > pg > > > > > > diff -uN a/restripe.c b/restripe.c > > --- a/restripe.c 2011-02-09 19:31:18.989495816 +0100 > > +++ b/restripe.c 2011-02-09 19:32:42.597955058 +0100 > > @@ -430,7 +430,8 @@ > > > > > > if((Px != 0) && (Qx != 0)) { > > - data_id = (raid6_gflog[Qx] - raid6_gflog[Px]) & 0xFF; > > + data_id = (raid6_gflog[Qx] - raid6_gflog[Px]); > > + if(data_id < 0) data_id += 255; > > diskD = geo_map(data_id, start/chunk_size, > > data_disks + 2, level, layout); > > curr_broken_disk = diskD; > > @@ -439,6 +440,9 @@ > > if((Px == 0) && (Qx == 0)) > > curr_broken_disk = curr_broken_disk; > > > > + if(curr_broken_disk >= data_disks + 2) > > + broken_status = 2; > > + > > switch(broken_status) { > > case 0: > > if(curr_broken_disk != -1) { > > @@ -450,10 +454,6 @@ > > case 1: > > if(curr_broken_disk != prev_broken_disk) > > broken_status = 2; > > - > > - if(curr_broken_disk >= data_disks + 2) > > - broken_status = 2; > > - > > break; > > > > case 2: > > > > bye, > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- piergiorgio ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] Re: User space RAID-6 access 2011-02-17 6:23 ` NeilBrown 2011-02-17 20:01 ` Piergiorgio Sartor @ 2011-02-18 23:02 ` Piergiorgio Sartor 1 sibling, 0 replies; 14+ messages in thread From: Piergiorgio Sartor @ 2011-02-18 23:02 UTC (permalink / raw) To: NeilBrown; +Cc: Piergiorgio Sartor, linux-raid > As this is really just for testing, I would add a :offset suffix to the > device names where the offset is in sectors. > So: > /dev/sda > means read from the start of /dev/sda > > /dev/sda:2048 > > means start 1Meg into /dev/sda. > > That should be easy to parse and is quite general. Hi again, please find attached below another mini-patch which should add the offset option, as discussed. That is, "/dev/sdX:offset". It should work with or without the ":offset" tail, and possibly also with "/dev/sdX:", depending on "atoll()". Please consider for inclusion. Thanks, bye, diff -uN a/restripe.c b/restripe.c --- a/restripe.c 2011-02-18 23:18:20.377740868 +0100 +++ b/restripe.c 2011-02-18 23:30:07.589841525 +0100 @@ -875,6 +875,14 @@ exit(3); } for (i=0; i<raid_disks; i++) { + char *p; + p = strchr(argv[9+i], ':'); + + if(p != NULL) { + *p++ = '\0'; + offsets[i] = atoll(p) * 512; + } + fds[i] = open(argv[9+i], O_RDWR); if (fds[i] < 0) { perror(argv[9+i]); -- piergiorgio ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2011-02-18 23:02 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-01-31 20:20 User space RAID-6 access Piergiorgio Sartor 2011-01-31 20:52 ` NeilBrown 2011-02-01 19:21 ` Piergiorgio Sartor 2011-02-01 20:14 ` John Robinson 2011-02-01 20:18 ` NeilBrown 2011-02-01 21:00 ` Piergiorgio Sartor 2011-02-05 17:33 ` Piergiorgio Sartor 2011-02-05 20:58 ` NeilBrown 2011-02-07 22:24 ` [PATCH] " Piergiorgio Sartor 2011-02-07 22:49 ` NeilBrown 2011-02-09 18:47 ` Piergiorgio Sartor 2011-02-17 6:23 ` NeilBrown 2011-02-17 20:01 ` Piergiorgio Sartor 2011-02-18 23:02 ` Piergiorgio Sartor
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.