All of lore.kernel.org
 help / color / mirror / Atom feed
* User space RAID-6 access
@ 2011-01-31 20:20 Piergiorgio Sartor
  2011-01-31 20:52 ` NeilBrown
  0 siblings, 1 reply; 14+ messages in thread
From: Piergiorgio Sartor @ 2011-01-31 20:20 UTC (permalink / raw)
  To: linux-raid

Hi all,

some times ago, I think was Neil, it was mentioned
about the possibility to access consistently a RAID-6
array from user space, in order to be able to perform
some checks, like the notorius "which HDD has wrong data".

Is there any reference or documentation or source code
which can be taken as example for such a case?

Thanks,

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: User space RAID-6 access
  2011-01-31 20:20 User space RAID-6 access Piergiorgio Sartor
@ 2011-01-31 20:52 ` NeilBrown
  2011-02-01 19:21   ` Piergiorgio Sartor
  2011-02-05 17:33   ` Piergiorgio Sartor
  0 siblings, 2 replies; 14+ messages in thread
From: NeilBrown @ 2011-01-31 20:52 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

On Mon, 31 Jan 2011 21:20:55 +0100 Piergiorgio Sartor
<piergiorgio.sartor@nexgo.de> wrote:

> Hi all,
> 
> some times ago, I think was Neil, it was mentioned
> about the possibility to access consistently a RAID-6
> array from user space, in order to be able to perform
> some checks, like the notorius "which HDD has wrong data".
> 
> Is there any reference or documentation or source code
> which can be taken as example for such a case?
> 

Look in the mdadm source code, particularly at restripe.c

Also
   make test_stripe

make a program the the test suite uses for verify data correctness.

That should give you enough hints to get you started.

NeilBrown


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: User space RAID-6 access
  2011-01-31 20:52 ` NeilBrown
@ 2011-02-01 19:21   ` Piergiorgio Sartor
  2011-02-01 20:14     ` John Robinson
  2011-02-01 20:18     ` NeilBrown
  2011-02-05 17:33   ` Piergiorgio Sartor
  1 sibling, 2 replies; 14+ messages in thread
From: Piergiorgio Sartor @ 2011-02-01 19:21 UTC (permalink / raw)
  To: NeilBrown; +Cc: Piergiorgio Sartor, linux-raid

On Tue, Feb 01, 2011 at 07:52:59AM +1100, NeilBrown wrote:
> On Mon, 31 Jan 2011 21:20:55 +0100 Piergiorgio Sartor
> <piergiorgio.sartor@nexgo.de> wrote:
> 
> > Hi all,
> > 
> > some times ago, I think was Neil, it was mentioned
> > about the possibility to access consistently a RAID-6
> > array from user space, in order to be able to perform
> > some checks, like the notorius "which HDD has wrong data".
> > 
> > Is there any reference or documentation or source code
> > which can be taken as example for such a case?
> > 
> 
> Look in the mdadm source code, particularly at restripe.c
> 
> Also
>    make test_stripe
> 
> make a program the the test suite uses for verify data correctness.
> 
> That should give you enough hints to get you started.


Hi Neil, thanks for the pointer.

I had a look at the code and there is something I did not get.

It seems to me the HDDs composing the array are "simply"
opened with "open".

In case the array is in use, how is avoided a race condition
between the "test_stripe" program and the md device?

I mean, the first could start to read from the HDDs and the
other could write something in the same place, leading to an
inconsistent parity, from the "test_stripe" point of view,
since it missed some update.

Or there is a locking inside md (in "test_stripe" I could
not see and it would also be dangerous)?

Thanks again,

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: User space RAID-6 access
  2011-02-01 19:21   ` Piergiorgio Sartor
@ 2011-02-01 20:14     ` John Robinson
  2011-02-01 20:18     ` NeilBrown
  1 sibling, 0 replies; 14+ messages in thread
From: John Robinson @ 2011-02-01 20:14 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: NeilBrown, linux-raid

On 01/02/2011 19:21, Piergiorgio Sartor wrote:
> On Tue, Feb 01, 2011 at 07:52:59AM +1100, NeilBrown wrote:
>> On Mon, 31 Jan 2011 21:20:55 +0100 Piergiorgio Sartor
>> <piergiorgio.sartor@nexgo.de>  wrote:
>>
>>> Hi all,
>>>
>>> some times ago, I think was Neil, it was mentioned
>>> about the possibility to access consistently a RAID-6
>>> array from user space, in order to be able to perform
>>> some checks, like the notorius "which HDD has wrong data".
>>>
>>> Is there any reference or documentation or source code
>>> which can be taken as example for such a case?
>>>
>>
>> Look in the mdadm source code, particularly at restripe.c
>>
>> Also
>>     make test_stripe
>>
>> make a program the the test suite uses for verify data correctness.
>>
>> That should give you enough hints to get you started.
>
>
> Hi Neil, thanks for the pointer.
>
> I had a look at the code and there is something I did not get.
>
> It seems to me the HDDs composing the array are "simply"
> opened with "open".
>
> In case the array is in use, how is avoided a race condition
> between the "test_stripe" program and the md device?
>
> I mean, the first could start to read from the HDDs and the
> other could write something in the same place, leading to an
> inconsistent parity, from the "test_stripe" point of view,
> since it missed some update.
>
> Or there is a locking inside md (in "test_stripe" I could
> not see and it would also be dangerous)?

Look at Grow.c and suspend_lo and suspend_hi.

Just my €$£0.005, I may be wrong, etc.

Cheers,

John.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: User space RAID-6 access
  2011-02-01 19:21   ` Piergiorgio Sartor
  2011-02-01 20:14     ` John Robinson
@ 2011-02-01 20:18     ` NeilBrown
  2011-02-01 21:00       ` Piergiorgio Sartor
  1 sibling, 1 reply; 14+ messages in thread
From: NeilBrown @ 2011-02-01 20:18 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

On Tue, 1 Feb 2011 20:21:54 +0100 Piergiorgio Sartor
<piergiorgio.sartor@nexgo.de> wrote:

> On Tue, Feb 01, 2011 at 07:52:59AM +1100, NeilBrown wrote:
> > On Mon, 31 Jan 2011 21:20:55 +0100 Piergiorgio Sartor
> > <piergiorgio.sartor@nexgo.de> wrote:
> > 
> > > Hi all,
> > > 
> > > some times ago, I think was Neil, it was mentioned
> > > about the possibility to access consistently a RAID-6
> > > array from user space, in order to be able to perform
> > > some checks, like the notorius "which HDD has wrong data".
> > > 
> > > Is there any reference or documentation or source code
> > > which can be taken as example for such a case?
> > > 
> > 
> > Look in the mdadm source code, particularly at restripe.c
> > 
> > Also
> >    make test_stripe
> > 
> > make a program the the test suite uses for verify data correctness.
> > 
> > That should give you enough hints to get you started.
> 
> 
> Hi Neil, thanks for the pointer.
> 
> I had a look at the code and there is something I did not get.
> 
> It seems to me the HDDs composing the array are "simply"
> opened with "open".
> 
> In case the array is in use, how is avoided a race condition
> between the "test_stripe" program and the md device?
> 
> I mean, the first could start to read from the HDDs and the
> other could write something in the same place, leading to an
> inconsistent parity, from the "test_stripe" point of view,
> since it missed some update.
> 
> Or there is a locking inside md (in "test_stripe" I could
> not see and it would also be dangerous)?
> 
> Thanks again,
> 
> bye,
> 

I didn't realise that you wanted to look at the members of the array while
the array was active!!  That is a bit harder.  But not impossible.

If you write a couple of numbers to 'suspend_lo' and 'suspend_hi' in sysfs,
then writes to blocks between those two array addresses will be blocked.
So you could suspend a region, look at the blocks, then un-suspend.

NeilBrown

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: User space RAID-6 access
  2011-02-01 20:18     ` NeilBrown
@ 2011-02-01 21:00       ` Piergiorgio Sartor
  0 siblings, 0 replies; 14+ messages in thread
From: Piergiorgio Sartor @ 2011-02-01 21:00 UTC (permalink / raw)
  To: NeilBrown; +Cc: Piergiorgio Sartor, linux-raid

> I didn't realise that you wanted to look at the members of the array while
> the array was active!!  That is a bit harder.  But not impossible.

Well, let's say that would be optimal, but even
off-line could be OK.

BTW, what does it mean "off-line"?
Does it mean that /dev/mdX exists, but it is not
used (un-mounted), or it is just read-only, or
it should *not* exists and only the componing
HDDs are available?
 
> If you write a couple of numbers to 'suspend_lo' and 'suspend_hi' in sysfs,
> then writes to blocks between those two array addresses will be blocked.
> So you could suspend a region, look at the blocks, then un-suspend.

Oh, thanks, I will consider the option.

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: User space RAID-6 access
  2011-01-31 20:52 ` NeilBrown
  2011-02-01 19:21   ` Piergiorgio Sartor
@ 2011-02-05 17:33   ` Piergiorgio Sartor
  2011-02-05 20:58     ` NeilBrown
  1 sibling, 1 reply; 14+ messages in thread
From: Piergiorgio Sartor @ 2011-02-05 17:33 UTC (permalink / raw)
  To: NeilBrown; +Cc: Piergiorgio Sartor, linux-raid

> Look in the mdadm source code, particularly at restripe.c
> 
> Also
>    make test_stripe
> 
> make a program the the test suite uses for verify data correctness.
> 
> That should give you enough hints to get you started.

Hi Neil,

I'm trying to use the "test_stripe" binary, just to
confirm the operation, in test mode (no save nor restore).

Unfortunately, it seems not work properly, in my hands.

I created a 4 disks RAID-6 (/dev/loop[0-3]), with:

mdadm -C /dev/md111 -l 6 -n 4 --chunk=64 /dev/loop[0-3]

Filled the array from urandom:

dd if=/dev/urandom of=/dev/md111

And tried:

./test_stripe test file_test.raw 4 65536 6 2 65536 $[ 65536 * 3 ] /dev/loop[0-3]

This returns:

0->0
1->1
P(2) wrong at 1
Q(3) wrong at 1
0->3
1->0
P(1) wrong at 2
Q(2) wrong at 2
0->2
1->3
P(0) wrong at 3
Q(1) wrong at 3

The array filled with "0" does not return any error.

Am'I missing something or the code has problems?

An other question, I noticed the code uses an array of
"offsets" which seem to be filled with "0" and never
changed.
Is this really wanted?
Is the offset information the one found per component
using "mdadm -E ..." or /sys/class/block/mdX/md/rdX/offset?
What's the relation with the "start" parameter of "test_stripe"?

Thanks in advance for any hints or suggestions.

The md device is:

/dev/md111:
        Version : 1.2
  Creation Time : Sat Feb  5 17:48:49 2011
     Raid Level : raid6
     Array Size : 524160 (511.96 MiB 536.74 MB)
  Used Dev Size : 262080 (255.98 MiB 268.37 MB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Sat Feb  5 18:19:45 2011
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           Name : lazy.lzy:111  (local to host lazy.lzy)
           UUID : b794fb80:c4006853:47761570:3d97a1d2
         Events : 42

    Number   Major   Minor   RaidDevice State
       0       7        0        0      active sync   /dev/loop0
       1       7        1        1      active sync   /dev/loop1
       2       7        2        2      active sync   /dev/loop2
       3       7        3        3      active sync   /dev/loop3

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: User space RAID-6 access
  2011-02-05 17:33   ` Piergiorgio Sartor
@ 2011-02-05 20:58     ` NeilBrown
  2011-02-07 22:24       ` [PATCH] " Piergiorgio Sartor
  0 siblings, 1 reply; 14+ messages in thread
From: NeilBrown @ 2011-02-05 20:58 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

On Sat, 5 Feb 2011 18:33:34 +0100 Piergiorgio Sartor
<piergiorgio.sartor@nexgo.de> wrote:

> > Look in the mdadm source code, particularly at restripe.c
> > 
> > Also
> >    make test_stripe
> > 
> > make a program the the test suite uses for verify data correctness.
> > 
> > That should give you enough hints to get you started.
> 
> Hi Neil,
> 
> I'm trying to use the "test_stripe" binary, just to
> confirm the operation, in test mode (no save nor restore).
> 
> Unfortunately, it seems not work properly, in my hands.
> 
> I created a 4 disks RAID-6 (/dev/loop[0-3]), with:
> 
> mdadm -C /dev/md111 -l 6 -n 4 --chunk=64 /dev/loop[0-3]
> 
> Filled the array from urandom:
> 
> dd if=/dev/urandom of=/dev/md111
> 
> And tried:
> 
> ./test_stripe test file_test.raw 4 65536 6 2 65536 $[ 65536 * 3 ] /dev/loop[0-3]
> 
> This returns:
> 
> 0->0
> 1->1
> P(2) wrong at 1
> Q(3) wrong at 1
> 0->3
> 1->0
> P(1) wrong at 2
> Q(2) wrong at 2
> 0->2
> 1->3
> P(0) wrong at 3
> Q(1) wrong at 3
> 
> The array filled with "0" does not return any error.
> 
> Am'I missing something or the code has problems?
> 
> An other question, I noticed the code uses an array of
> "offsets" which seem to be filled with "0" and never
> changed.
> Is this really wanted?
> Is the offset information the one found per component
> using "mdadm -E ..." or /sys/class/block/mdX/md/rdX/offset?
> What's the relation with the "start" parameter of "test_stripe"?
> 
> Thanks in advance for any hints or suggestions.

test_stripe assumes that the data starts at the start of each device.
AS you are using 1.2 metadata (the default), data starts about 1M in to
the device (I think - you can check with --examine)

You could fix test_stripe to put the right value in the 'offsets' array,
or you could create the array with 1.0 or 0.90 metadata.

NeilBrown



> 
> The md device is:
> 
> /dev/md111:
>         Version : 1.2
>   Creation Time : Sat Feb  5 17:48:49 2011
>      Raid Level : raid6
>      Array Size : 524160 (511.96 MiB 536.74 MB)
>   Used Dev Size : 262080 (255.98 MiB 268.37 MB)
>    Raid Devices : 4
>   Total Devices : 4
>     Persistence : Superblock is persistent
> 
>     Update Time : Sat Feb  5 18:19:45 2011
>           State : clean
>  Active Devices : 4
> Working Devices : 4
>  Failed Devices : 0
>   Spare Devices : 0
> 
>          Layout : left-symmetric
>      Chunk Size : 64K
> 
>            Name : lazy.lzy:111  (local to host lazy.lzy)
>            UUID : b794fb80:c4006853:47761570:3d97a1d2
>          Events : 42
> 
>     Number   Major   Minor   RaidDevice State
>        0       7        0        0      active sync   /dev/loop0
>        1       7        1        1      active sync   /dev/loop1
>        2       7        2        2      active sync   /dev/loop2
>        3       7        3        3      active sync   /dev/loop3
> 
> bye,
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] Re: User space RAID-6 access
  2011-02-05 20:58     ` NeilBrown
@ 2011-02-07 22:24       ` Piergiorgio Sartor
  2011-02-07 22:49         ` NeilBrown
  0 siblings, 1 reply; 14+ messages in thread
From: Piergiorgio Sartor @ 2011-02-07 22:24 UTC (permalink / raw)
  To: NeilBrown; +Cc: Piergiorgio Sartor, linux-raid

> test_stripe assumes that the data starts at the start of each device.
> AS you are using 1.2 metadata (the default), data starts about 1M in to
> the device (I think - you can check with --examine)
> 
> You could fix test_stripe to put the right value in the 'offsets' array,
> or you could create the array with 1.0 or 0.90 metadata.

Hi Neil,

thanks for the info, maybe this should be a second patch.

In the meantime, please find attached a patch to restripe.c
of mdadm 3.2 (latest, I hope).

This should add the functionality to detect, in RAID-6,
which of the disks potentially has problems, in case of
parity errors.

Some checks take place in order to avoid false positives,
I hope these are correct and enough.

I'm not 100% happy of the interface (too much redundancy),
but for the time being it could be OK.

Of course, any improvement is welcome.

Please consider to include these changes to the next mdadm
whatever release.

bye,

--- restripe.c.org	2011-01-13 05:52:15.000000000 +0100
+++ restripe.c	2011-02-07 22:51:01.539471472 +0100
@@ -285,10 +285,13 @@
 uint8_t raid6_gfexp[256];
 uint8_t raid6_gfinv[256];
 uint8_t raid6_gfexi[256];
+uint8_t raid6_gflog[256];
+uint8_t raid6_gfilog[256];
 void make_tables(void)
 {
 	int i, j;
 	uint8_t v;
+	uint32_t b, log;
 
 	/* Compute multiplication table */
 	for (i = 0; i < 256; i++)
@@ -312,6 +315,19 @@
 	for (i = 0; i < 256; i ++)
 		raid6_gfexi[i] = raid6_gfinv[raid6_gfexp[i] ^ 1];
 
+	/* Compute log and inverse log */
+	/* Modified code from: http://web.eecs.utk.edu/~plank/plank/papers/CS-96-332.html */
+	b = 1;
+	raid6_gflog[0] = 0;
+	raid6_gfilog[255] = 0;
+
+	for (log = 0; log < 255; log++) {
+	  raid6_gflog[b] = (uint8_t) log;
+	  raid6_gfilog[log] = (uint8_t) b;
+	  b = b << 1;
+	  if (b & 256) b = b ^ 0435;
+	}
+
 	tables_ready = 1;
 }
 
@@ -387,6 +403,65 @@
 	}
 }
 
+/* Try to find out if a specific disk has a problem */
+int raid6_check_disks(int data_disks, int start, int chunk_size, int level, int layout, int diskP, int diskQ, char *p, char *q, char **stripes)
+{
+  int i;
+  int data_id, diskD;
+  uint8_t Px, Qx;
+  int curr_broken_disk = -1;
+  int prev_broken_disk = -1;
+  int broken_status = 0;
+
+  for(i = 0; i < chunk_size; i++) {
+    Px = (uint8_t)stripes[diskP][i] ^ (uint8_t)p[i];
+    Qx = (uint8_t)stripes[diskQ][i] ^ (uint8_t)q[i];
+
+    if((Px != 0) && (Qx == 0)) {
+      curr_broken_disk = diskP;
+    }
+
+    if((Px == 0) && (Qx != 0)) {
+      curr_broken_disk = diskQ;
+    }
+
+    if((Px != 0) && (Qx != 0)) {
+      data_id = (raid6_gflog[Qx] - raid6_gflog[Px]) & 0xFF;
+      diskD = geo_map(data_id, start/chunk_size, data_disks + 2, level, layout);
+      curr_broken_disk = diskD;
+    }
+
+    if((Px == 0) && (Qx == 0)) {
+      curr_broken_disk = curr_broken_disk;
+    }
+
+    switch(broken_status) {
+    case 0:
+      if(curr_broken_disk != -1) {
+	prev_broken_disk = curr_broken_disk;
+	broken_status = 1;
+      }
+      break;
+
+    case 1:
+      if(curr_broken_disk != prev_broken_disk) {
+	broken_status = 2;
+      }
+      if(curr_broken_disk >= data_disks + 2) {
+	broken_status = 2;
+      }
+      break;
+
+    case 2:
+    default:
+      curr_broken_disk = prev_broken_disk = -2;
+      break;
+    }
+  }
+
+  return curr_broken_disk;
+}
+
 /* Save data:
  * We are given:
  *  A list of 'fds' of the active disks.  Some may be absent.
@@ -673,7 +748,12 @@
 	char *q = malloc(chunk_size);
 
 	int i;
+	int diskP, diskQ;
 	int data_disks = raid_disks - (level == 5 ? 1: 2);
+
+	if (!tables_ready)
+		make_tables();
+
 	for ( i = 0 ; i < raid_disks ; i++)
 		stripes[i] = stripe_buf + i * chunk_size;
 
@@ -693,18 +773,25 @@
 		switch(level) {
 		case 6:
 			qsyndrome(p, q, (uint8_t**)blocks, data_disks, chunk_size);
-			disk = geo_map(-1, start/chunk_size, raid_disks,
+			diskP = geo_map(-1, start/chunk_size, raid_disks,
 				       level, layout);
-			if (memcmp(p, stripes[disk], chunk_size) != 0) {
-				printf("P(%d) wrong at %llu\n", disk,
+			if (memcmp(p, stripes[diskP], chunk_size) != 0) {
+				printf("P(%d) wrong at %llu\n", diskP,
 				       start / chunk_size);
 			}
-			disk = geo_map(-2, start/chunk_size, raid_disks,
+			diskQ = geo_map(-2, start/chunk_size, raid_disks,
 				       level, layout);
-			if (memcmp(q, stripes[disk], chunk_size) != 0) {
-				printf("Q(%d) wrong at %llu\n", disk,
+			if (memcmp(q, stripes[diskQ], chunk_size) != 0) {
+				printf("Q(%d) wrong at %llu\n", diskQ,
 				       start / chunk_size);
 			}
+			disk = raid6_check_disks(data_disks, start, chunk_size, level, layout, diskP, diskQ, p, q, stripes);
+			if(disk >= 0) {
+			  printf("Possible failed disk: %d\n", disk);
+			}
+			if(disk == -2) {
+			  printf("Failure detected, but disk unknown\n");
+			}
 			break;
 		}
 		length -= chunk_size;

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Re: User space RAID-6 access
  2011-02-07 22:24       ` [PATCH] " Piergiorgio Sartor
@ 2011-02-07 22:49         ` NeilBrown
  2011-02-09 18:47           ` Piergiorgio Sartor
  0 siblings, 1 reply; 14+ messages in thread
From: NeilBrown @ 2011-02-07 22:49 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

On Mon, 7 Feb 2011 23:24:59 +0100 Piergiorgio Sartor
<piergiorgio.sartor@nexgo.de> wrote:

> > test_stripe assumes that the data starts at the start of each device.
> > AS you are using 1.2 metadata (the default), data starts about 1M in to
> > the device (I think - you can check with --examine)
> > 
> > You could fix test_stripe to put the right value in the 'offsets' array,
> > or you could create the array with 1.0 or 0.90 metadata.
> 
> Hi Neil,
> 
> thanks for the info, maybe this should be a second patch.
> 
> In the meantime, please find attached a patch to restripe.c
> of mdadm 3.2 (latest, I hope).
> 
> This should add the functionality to detect, in RAID-6,
> which of the disks potentially has problems, in case of
> parity errors.
> 
> Some checks take place in order to avoid false positives,
> I hope these are correct and enough.
> 
> I'm not 100% happy of the interface (too much redundancy),
> but for the time being it could be OK.
> 
> Of course, any improvement is welcome.
> 
> Please consider to include these changes to the next mdadm
> whatever release.
> 

Thanks a lot!

I have applied some patch - with some formatting changes to make it consistent
with the rest of the code.

I don't really have time to look more deeply at it at the moment.
Maybe someone else will?...

Thanks,
NeilBrown


> bye,
> 
> --- restripe.c.org	2011-01-13 05:52:15.000000000 +0100
> +++ restripe.c	2011-02-07 22:51:01.539471472 +0100
> @@ -285,10 +285,13 @@
>  uint8_t raid6_gfexp[256];
>  uint8_t raid6_gfinv[256];
>  uint8_t raid6_gfexi[256];
> +uint8_t raid6_gflog[256];
> +uint8_t raid6_gfilog[256];
>  void make_tables(void)
>  {
>  	int i, j;
>  	uint8_t v;
> +	uint32_t b, log;
>  
>  	/* Compute multiplication table */
>  	for (i = 0; i < 256; i++)
> @@ -312,6 +315,19 @@
>  	for (i = 0; i < 256; i ++)
>  		raid6_gfexi[i] = raid6_gfinv[raid6_gfexp[i] ^ 1];
>  
> +	/* Compute log and inverse log */
> +	/* Modified code from: http://web.eecs.utk.edu/~plank/plank/papers/CS-96-332.html */
> +	b = 1;
> +	raid6_gflog[0] = 0;
> +	raid6_gfilog[255] = 0;
> +
> +	for (log = 0; log < 255; log++) {
> +	  raid6_gflog[b] = (uint8_t) log;
> +	  raid6_gfilog[log] = (uint8_t) b;
> +	  b = b << 1;
> +	  if (b & 256) b = b ^ 0435;
> +	}
> +
>  	tables_ready = 1;
>  }
>  
> @@ -387,6 +403,65 @@
>  	}
>  }
>  
> +/* Try to find out if a specific disk has a problem */
> +int raid6_check_disks(int data_disks, int start, int chunk_size, int level, int layout, int diskP, int diskQ, char *p, char *q, char **stripes)
> +{
> +  int i;
> +  int data_id, diskD;
> +  uint8_t Px, Qx;
> +  int curr_broken_disk = -1;
> +  int prev_broken_disk = -1;
> +  int broken_status = 0;
> +
> +  for(i = 0; i < chunk_size; i++) {
> +    Px = (uint8_t)stripes[diskP][i] ^ (uint8_t)p[i];
> +    Qx = (uint8_t)stripes[diskQ][i] ^ (uint8_t)q[i];
> +
> +    if((Px != 0) && (Qx == 0)) {
> +      curr_broken_disk = diskP;
> +    }
> +
> +    if((Px == 0) && (Qx != 0)) {
> +      curr_broken_disk = diskQ;
> +    }
> +
> +    if((Px != 0) && (Qx != 0)) {
> +      data_id = (raid6_gflog[Qx] - raid6_gflog[Px]) & 0xFF;
> +      diskD = geo_map(data_id, start/chunk_size, data_disks + 2, level, layout);
> +      curr_broken_disk = diskD;
> +    }
> +
> +    if((Px == 0) && (Qx == 0)) {
> +      curr_broken_disk = curr_broken_disk;
> +    }
> +
> +    switch(broken_status) {
> +    case 0:
> +      if(curr_broken_disk != -1) {
> +	prev_broken_disk = curr_broken_disk;
> +	broken_status = 1;
> +      }
> +      break;
> +
> +    case 1:
> +      if(curr_broken_disk != prev_broken_disk) {
> +	broken_status = 2;
> +      }
> +      if(curr_broken_disk >= data_disks + 2) {
> +	broken_status = 2;
> +      }
> +      break;
> +
> +    case 2:
> +    default:
> +      curr_broken_disk = prev_broken_disk = -2;
> +      break;
> +    }
> +  }
> +
> +  return curr_broken_disk;
> +}
> +
>  /* Save data:
>   * We are given:
>   *  A list of 'fds' of the active disks.  Some may be absent.
> @@ -673,7 +748,12 @@
>  	char *q = malloc(chunk_size);
>  
>  	int i;
> +	int diskP, diskQ;
>  	int data_disks = raid_disks - (level == 5 ? 1: 2);
> +
> +	if (!tables_ready)
> +		make_tables();
> +
>  	for ( i = 0 ; i < raid_disks ; i++)
>  		stripes[i] = stripe_buf + i * chunk_size;
>  
> @@ -693,18 +773,25 @@
>  		switch(level) {
>  		case 6:
>  			qsyndrome(p, q, (uint8_t**)blocks, data_disks, chunk_size);
> -			disk = geo_map(-1, start/chunk_size, raid_disks,
> +			diskP = geo_map(-1, start/chunk_size, raid_disks,
>  				       level, layout);
> -			if (memcmp(p, stripes[disk], chunk_size) != 0) {
> -				printf("P(%d) wrong at %llu\n", disk,
> +			if (memcmp(p, stripes[diskP], chunk_size) != 0) {
> +				printf("P(%d) wrong at %llu\n", diskP,
>  				       start / chunk_size);
>  			}
> -			disk = geo_map(-2, start/chunk_size, raid_disks,
> +			diskQ = geo_map(-2, start/chunk_size, raid_disks,
>  				       level, layout);
> -			if (memcmp(q, stripes[disk], chunk_size) != 0) {
> -				printf("Q(%d) wrong at %llu\n", disk,
> +			if (memcmp(q, stripes[diskQ], chunk_size) != 0) {
> +				printf("Q(%d) wrong at %llu\n", diskQ,
>  				       start / chunk_size);
>  			}
> +			disk = raid6_check_disks(data_disks, start, chunk_size, level, layout, diskP, diskQ, p, q, stripes);
> +			if(disk >= 0) {
> +			  printf("Possible failed disk: %d\n", disk);
> +			}
> +			if(disk == -2) {
> +			  printf("Failure detected, but disk unknown\n");
> +			}
>  			break;
>  		}
>  		length -= chunk_size;
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Re: User space RAID-6 access
  2011-02-07 22:49         ` NeilBrown
@ 2011-02-09 18:47           ` Piergiorgio Sartor
  2011-02-17  6:23             ` NeilBrown
  0 siblings, 1 reply; 14+ messages in thread
From: Piergiorgio Sartor @ 2011-02-09 18:47 UTC (permalink / raw)
  To: NeilBrown; +Cc: Piergiorgio Sartor, linux-raid

> I have applied some patch - with some formatting changes to make it consistent
> with the rest of the code.
> 
> I don't really have time to look more deeply at it at the moment.
> Maybe someone else will?...

Hi Neil,

thanks for including this in git.

Actually I did it look at it :-) and I already found a
couple of issues, below is a small fix patch.

One question about offsets.
I noticed that offsets seems to be defined per component,
in the md software stack.
I mean, looking at sysfs, the offset are inside md/dev-sdX/offset
or available with mdadm -E /dev/sdX
It seems possible that different components have different
offset (I can imagine this happening with different sizes).

Now, from inside "test_stripe" there is no information about
the RAID itself, I mean the device, nor about the metadata type.

What would be the best way to get the offsets inside "restripe.c"?

1) popen "mdadm -E ..." and grep the "Data Offset" field.
2) pass one single offset as command line parameter.
3) add the whole md device as command line parameter, removing
the single devices, and use sysfs.

Thanks for any advice,

bye,

pg


diff -uN a/restripe.c b/restripe.c
--- a/restripe.c	2011-02-09 19:31:18.989495816 +0100
+++ b/restripe.c	2011-02-09 19:32:42.597955058 +0100
@@ -430,7 +430,8 @@
 
 
 		if((Px != 0) && (Qx != 0)) {
-			data_id = (raid6_gflog[Qx] - raid6_gflog[Px]) & 0xFF;
+			data_id = (raid6_gflog[Qx] - raid6_gflog[Px]);
+			if(data_id < 0) data_id += 255;
 			diskD = geo_map(data_id, start/chunk_size,
 					data_disks + 2, level, layout);
 			curr_broken_disk = diskD;
@@ -439,6 +440,9 @@
 		if((Px == 0) && (Qx == 0))
 			curr_broken_disk = curr_broken_disk;
 
+		if(curr_broken_disk >= data_disks + 2)
+			broken_status = 2;
+
 		switch(broken_status) {
 		case 0:
 			if(curr_broken_disk != -1) {
@@ -450,10 +454,6 @@
 		case 1:
 			if(curr_broken_disk != prev_broken_disk)
 				broken_status = 2;
-
-			if(curr_broken_disk >= data_disks + 2)
-				broken_status = 2;
-
 			break;
 
 		case 2:
 
bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Re: User space RAID-6 access
  2011-02-09 18:47           ` Piergiorgio Sartor
@ 2011-02-17  6:23             ` NeilBrown
  2011-02-17 20:01               ` Piergiorgio Sartor
  2011-02-18 23:02               ` Piergiorgio Sartor
  0 siblings, 2 replies; 14+ messages in thread
From: NeilBrown @ 2011-02-17  6:23 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

On Wed, 9 Feb 2011 19:47:34 +0100 Piergiorgio Sartor
<piergiorgio.sartor@nexgo.de> wrote:

> > I have applied some patch - with some formatting changes to make it consistent
> > with the rest of the code.
> > 
> > I don't really have time to look more deeply at it at the moment.
> > Maybe someone else will?...
> 
> Hi Neil,
> 
> thanks for including this in git.
> 
> Actually I did it look at it :-) and I already found a
> couple of issues, below is a small fix patch.

Sorry for delay - I have applied this now.


> 
> One question about offsets.
> I noticed that offsets seems to be defined per component,
> in the md software stack.
> I mean, looking at sysfs, the offset are inside md/dev-sdX/offset
> or available with mdadm -E /dev/sdX
> It seems possible that different components have different
> offset (I can imagine this happening with different sizes).
> 
> Now, from inside "test_stripe" there is no information about
> the RAID itself, I mean the device, nor about the metadata type.
> 
> What would be the best way to get the offsets inside "restripe.c"?
> 
> 1) popen "mdadm -E ..." and grep the "Data Offset" field.
> 2) pass one single offset as command line parameter.
> 3) add the whole md device as command line parameter, removing
> the single devices, and use sysfs.

As this is really just for testing, I would add a :offset  suffix to the
device names where the offset is in sectors.
So:
   /dev/sda
means read from the start of /dev/sda

   /dev/sda:2048

means start 1Meg into /dev/sda.

That should be easy to parse and is quite general.

If you wanted to create separate functionality that could be given an active
md device and would read from the component devices, I would link with
sysfs.c and use those routines to read info from sysfs and use that to guide
the code which finds the data.

Make sense?

Thanks,
NeilBrown



> 
> Thanks for any advice,
> 
> bye,
> 
> pg
> 
> 
> diff -uN a/restripe.c b/restripe.c
> --- a/restripe.c	2011-02-09 19:31:18.989495816 +0100
> +++ b/restripe.c	2011-02-09 19:32:42.597955058 +0100
> @@ -430,7 +430,8 @@
>  
>  
>  		if((Px != 0) && (Qx != 0)) {
> -			data_id = (raid6_gflog[Qx] - raid6_gflog[Px]) & 0xFF;
> +			data_id = (raid6_gflog[Qx] - raid6_gflog[Px]);
> +			if(data_id < 0) data_id += 255;
>  			diskD = geo_map(data_id, start/chunk_size,
>  					data_disks + 2, level, layout);
>  			curr_broken_disk = diskD;
> @@ -439,6 +440,9 @@
>  		if((Px == 0) && (Qx == 0))
>  			curr_broken_disk = curr_broken_disk;
>  
> +		if(curr_broken_disk >= data_disks + 2)
> +			broken_status = 2;
> +
>  		switch(broken_status) {
>  		case 0:
>  			if(curr_broken_disk != -1) {
> @@ -450,10 +454,6 @@
>  		case 1:
>  			if(curr_broken_disk != prev_broken_disk)
>  				broken_status = 2;
> -
> -			if(curr_broken_disk >= data_disks + 2)
> -				broken_status = 2;
> -
>  			break;
>  
>  		case 2:
>  
> bye,
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Re: User space RAID-6 access
  2011-02-17  6:23             ` NeilBrown
@ 2011-02-17 20:01               ` Piergiorgio Sartor
  2011-02-18 23:02               ` Piergiorgio Sartor
  1 sibling, 0 replies; 14+ messages in thread
From: Piergiorgio Sartor @ 2011-02-17 20:01 UTC (permalink / raw)
  To: NeilBrown; +Cc: Piergiorgio Sartor, linux-raid

On Thu, Feb 17, 2011 at 05:23:32PM +1100, NeilBrown wrote:
> On Wed, 9 Feb 2011 19:47:34 +0100 Piergiorgio Sartor
> <piergiorgio.sartor@nexgo.de> wrote:
> 
> > > I have applied some patch - with some formatting changes to make it consistent
> > > with the rest of the code.
> > > 
> > > I don't really have time to look more deeply at it at the moment.
> > > Maybe someone else will?...
> > 
> > Hi Neil,
> > 
> > thanks for including this in git.
> > 
> > Actually I did it look at it :-) and I already found a
> > couple of issues, below is a small fix patch.
> 
> Sorry for delay - I have applied this now.

No, problem, I know it could happen, just concerned
about the quality of the patch... :-)
 
> > One question about offsets.
> > I noticed that offsets seems to be defined per component,
> > in the md software stack.
> > I mean, looking at sysfs, the offset are inside md/dev-sdX/offset
> > or available with mdadm -E /dev/sdX
> > It seems possible that different components have different
> > offset (I can imagine this happening with different sizes).
> > 
> > Now, from inside "test_stripe" there is no information about
> > the RAID itself, I mean the device, nor about the metadata type.
> > 
> > What would be the best way to get the offsets inside "restripe.c"?
> > 
> > 1) popen "mdadm -E ..." and grep the "Data Offset" field.
> > 2) pass one single offset as command line parameter.
> > 3) add the whole md device as command line parameter, removing
> > the single devices, and use sysfs.
> 
> As this is really just for testing, I would add a :offset  suffix to the
> device names where the offset is in sectors.
> So:
>    /dev/sda
> means read from the start of /dev/sda
> 
>    /dev/sda:2048
> 
> means start 1Meg into /dev/sda.
> 
> That should be easy to parse and is quite general.
> 
> If you wanted to create separate functionality that could be given an active
> md device and would read from the component devices, I would link with
> sysfs.c and use those routines to read info from sysfs and use that to guide
> the code which finds the data.
> 
> Make sense?

Thanks for the hint, I guess it makes a lot of sense.

I'll try the ":", it seems to me reasonable for the time being.
Eventually, I could consider to write a separate piece of code
linking restripe.c and sysfs.c together.

Thanks again,

bye,

pg

> > bye,
> > 
> > pg
> > 
> > 
> > diff -uN a/restripe.c b/restripe.c
> > --- a/restripe.c	2011-02-09 19:31:18.989495816 +0100
> > +++ b/restripe.c	2011-02-09 19:32:42.597955058 +0100
> > @@ -430,7 +430,8 @@
> >  
> >  
> >  		if((Px != 0) && (Qx != 0)) {
> > -			data_id = (raid6_gflog[Qx] - raid6_gflog[Px]) & 0xFF;
> > +			data_id = (raid6_gflog[Qx] - raid6_gflog[Px]);
> > +			if(data_id < 0) data_id += 255;
> >  			diskD = geo_map(data_id, start/chunk_size,
> >  					data_disks + 2, level, layout);
> >  			curr_broken_disk = diskD;
> > @@ -439,6 +440,9 @@
> >  		if((Px == 0) && (Qx == 0))
> >  			curr_broken_disk = curr_broken_disk;
> >  
> > +		if(curr_broken_disk >= data_disks + 2)
> > +			broken_status = 2;
> > +
> >  		switch(broken_status) {
> >  		case 0:
> >  			if(curr_broken_disk != -1) {
> > @@ -450,10 +454,6 @@
> >  		case 1:
> >  			if(curr_broken_disk != prev_broken_disk)
> >  				broken_status = 2;
> > -
> > -			if(curr_broken_disk >= data_disks + 2)
> > -				broken_status = 2;
> > -
> >  			break;
> >  
> >  		case 2:
> >  
> > bye,
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Re: User space RAID-6 access
  2011-02-17  6:23             ` NeilBrown
  2011-02-17 20:01               ` Piergiorgio Sartor
@ 2011-02-18 23:02               ` Piergiorgio Sartor
  1 sibling, 0 replies; 14+ messages in thread
From: Piergiorgio Sartor @ 2011-02-18 23:02 UTC (permalink / raw)
  To: NeilBrown; +Cc: Piergiorgio Sartor, linux-raid

> As this is really just for testing, I would add a :offset  suffix to the
> device names where the offset is in sectors.
> So:
>    /dev/sda
> means read from the start of /dev/sda
> 
>    /dev/sda:2048
> 
> means start 1Meg into /dev/sda.
> 
> That should be easy to parse and is quite general.

Hi again,

please find attached below another mini-patch
which should add the offset option, as discussed.
That is, "/dev/sdX:offset".
It should work with or without the ":offset" tail,
and possibly also with "/dev/sdX:", depending on
"atoll()".

Please consider for inclusion.

Thanks,

bye,


diff -uN a/restripe.c b/restripe.c
--- a/restripe.c	2011-02-18 23:18:20.377740868 +0100
+++ b/restripe.c	2011-02-18 23:30:07.589841525 +0100
@@ -875,6 +875,14 @@
 		exit(3);
 	}
 	for (i=0; i<raid_disks; i++) {
+		char *p;
+		p = strchr(argv[9+i], ':');
+
+		if(p != NULL) {
+			*p++ = '\0';
+			offsets[i] = atoll(p) * 512;
+		}
+			
 		fds[i] = open(argv[9+i], O_RDWR);
 		if (fds[i] < 0) {
 			perror(argv[9+i]);

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2011-02-18 23:02 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-31 20:20 User space RAID-6 access Piergiorgio Sartor
2011-01-31 20:52 ` NeilBrown
2011-02-01 19:21   ` Piergiorgio Sartor
2011-02-01 20:14     ` John Robinson
2011-02-01 20:18     ` NeilBrown
2011-02-01 21:00       ` Piergiorgio Sartor
2011-02-05 17:33   ` Piergiorgio Sartor
2011-02-05 20:58     ` NeilBrown
2011-02-07 22:24       ` [PATCH] " Piergiorgio Sartor
2011-02-07 22:49         ` NeilBrown
2011-02-09 18:47           ` Piergiorgio Sartor
2011-02-17  6:23             ` NeilBrown
2011-02-17 20:01               ` Piergiorgio Sartor
2011-02-18 23:02               ` Piergiorgio Sartor

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.