All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Bugreport ddf rebuild problems
  2013-08-01 19:30 Bugreport ddf rebuild problems Albert Pauw
@ 2013-08-01 19:09 ` Martin Wilck
       [not found]   ` <CAGkViCFT0+qm9YAnAJM7JGgVg0RTJi8=HAYDTMs-mfhXinqdcg@mail.gmail.com>
  0 siblings, 1 reply; 22+ messages in thread
From: Martin Wilck @ 2013-08-01 19:09 UTC (permalink / raw)
  To: Albert Pauw, linux-raid

Hi Albert,

thanks for reporting this. I can reproduce it.
For curiosity, is this a regression?

Martin


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Bugreport ddf rebuild problems
@ 2013-08-01 19:30 Albert Pauw
  2013-08-01 19:09 ` Martin Wilck
  0 siblings, 1 reply; 22+ messages in thread
From: Albert Pauw @ 2013-08-01 19:30 UTC (permalink / raw)
  To: linux-raid

Now I've got it working, I checked a simple rebuild test with a ddf 
container, containing two raidsets (raid 1 and raid 5).

Having 6 loopback devices of 512 MB each I opened in a separate screen 
this to monitor what happens:

while true; do clear; cat /proc/mdstat ; sleep 1; done

Now are here my commands, en watch the rebuild.

[root@moonlight ~]# mdadm --zero-superblock /dev/loop[1-6]
[root@moonlight ~]# mdadm -CR /dev/md127 -e ddf -l container -n 5 
/dev/loop[1-5]
mdadm: /dev/loop1 appears to contain an ext2fs file system
        size=153600K  mtime=Thu Jan  1 01:00:00 1970
mdadm: /dev/loop2 appears to contain an ext2fs file system
        size=153600K  mtime=Thu Jan  1 01:00:00 1970
mdadm: container /dev/md127 prepared.
[root@moonlight ~]# mdadm -CR /dev/md0 -l raid1 -n 2 /dev/md0
mdadm: You haven't given enough devices (real or missing) to create this 
array
[root@moonlight ~]# mdadm -CR /dev/md0 -l raid1 -n 2 /dev/md127
mdadm: array /dev/md0 started.
[root@moonlight ~]# mdadm -CR /dev/md1 -l raid5 -n 3 /dev/md127
mdadm: array /dev/md1 started.
[root@moonlight ~]# man mdadm
Formatting page, please wait...
[root@moonlight ~]# mdadm -f /dev/md0 /dev/loop4
mdadm: set /dev/loop4 faulty in /dev/md0
[root@moonlight ~]# mdadm -f /dev/md1 /dev/loop2
mdadm: set /dev/loop2 faulty in /dev/md1
[root@moonlight ~]# mdadm --add /dev/md127 /dev/loop6
mdadm: added /dev/loop6


What should happen is that one of the raid sets get rebuild by the added 
spare, however
  the one raid set gets the spare disk (which is good) and starts 
rebuilding, the second one, well have a look:

root@moonlight ~]# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md1 : active raid5 loop6[3] loop1[2] loop3[0]
       958464 blocks super external:/md127/1 level 5, 512k chunk, 
algorithm 2 [3/3] [UUU]

md0 : active raid1 loop3[3](S) loop2[2] loop5[0]
       479232 blocks super external:/md127/0 [2/2] [UU]

md127 : inactive loop6[5](S) loop5[4](S) loop4[3](S) loop3[2](S) 
loop2[1](S) loop1[0](S)
       196608 blocks super external:ddf


Effectively a rebuild raid 1 set with three disks (md0) including a 
spare, using a failed disk.

Looks like something is going wrong here.

Albert



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Bugreport ddf rebuild problems
       [not found]   ` <CAGkViCFT0+qm9YAnAJM7JGgVg0RTJi8=HAYDTMs-mfhXinqdcg@mail.gmail.com>
@ 2013-08-01 21:13     ` Martin Wilck
  2013-08-01 22:09       ` Martin Wilck
  0 siblings, 1 reply; 22+ messages in thread
From: Martin Wilck @ 2013-08-01 21:13 UTC (permalink / raw)
  To: Albert Pauw, linux-raid

[-- Attachment #1: Type: text/plain, Size: 58 bytes --]

Can you please try the attached patch?

Regards,
Martin



[-- Attachment #2: 0001-DDF-ddf_process_update-delete-removed-disks-from-dli.patch --]
[-- Type: text/x-patch, Size: 1799 bytes --]

From e6b25d31ee79c1191f936e178324c97d2ce68a57 Mon Sep 17 00:00:00 2001
From: Martin Wilck <mwilck@arcor.de>
Date: Fri, 2 Aug 2013 00:11:20 +0200
Subject: [PATCH] DDF: ddf_process_update: delete removed disks from dlist

We currently remove Failed disks that are no longer used by any
VD. If we do that, we must remove them from dlist, too, because
otherwise the same pdnum may occur multiple times in the dlist.
---
 super-ddf.c |   24 +++++++++++++++++++++---
 1 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/super-ddf.c b/super-ddf.c
index 65472a2..f8d5fcd 100644
--- a/super-ddf.c
+++ b/super-ddf.c
@@ -4366,6 +4366,18 @@ static void copy_matching_bvd(struct ddf_super *ddf,
 	       conf->sec_elmnt_seq, guid_str(conf->guid));
 }
 
+static void _delete_dl_by_refnum(struct ddf_super *ddf, be32 refnum)
+{
+	struct dl **pdl = &ddf->dlist;
+	while (*pdl) {
+		if (be32_eq((*pdl)->disk.refnum, refnum)) {
+			free(*pdl);
+			*pdl = (*pdl)->next;
+		} else
+			pdl = &(*pdl)->next;
+	}
+}
+
 static void ddf_process_update(struct supertype *st,
 			       struct metadata_update *update)
 {
@@ -4638,9 +4650,15 @@ static void ddf_process_update(struct supertype *st,
 			if (be16_and(ddf->phys->entries[pdnum].state,
 				     cpu_to_be16(DDF_Failed))
 			    && be16_and(ddf->phys->entries[pdnum].state,
-					cpu_to_be16(DDF_Transition)))
-				/* skip this one */;
-			else if (pdnum == pd2)
+					cpu_to_be16(DDF_Transition))) {
+				/* skip this one and remove from dlist */
+				dprintf("%s: %08x no longer used, removing it\n",
+					__func__,
+					be32_to_cpu(ddf->phys->
+						    entries[pdnum].refnum));
+				_delete_dl_by_refnum(
+					ddf, ddf->phys->entries[pdnum].refnum);
+			} else if (pdnum == pd2)
 				pd2++;
 			else {
 				ddf->phys->entries[pd2] =
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Bugreport ddf rebuild problems
  2013-08-01 21:13     ` Martin Wilck
@ 2013-08-01 22:09       ` Martin Wilck
  2013-08-01 22:37         ` Martin Wilck
  0 siblings, 1 reply; 22+ messages in thread
From: Martin Wilck @ 2013-08-01 22:09 UTC (permalink / raw)
  To: Albert Pauw, linux-raid

On 08/01/2013 11:13 PM, Martin Wilck wrote:
> Can you please try the attached patch?

DON'T use it, it's broken. I will send a better one in a minute.


> Regards,
> Martin
> 
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Bugreport ddf rebuild problems
  2013-08-01 22:09       ` Martin Wilck
@ 2013-08-01 22:37         ` Martin Wilck
  2013-08-03  9:43           ` Albert Pauw
  0 siblings, 1 reply; 22+ messages in thread
From: Martin Wilck @ 2013-08-01 22:37 UTC (permalink / raw)
  To: Albert Pauw, linux-raid

Hi Albert,

On 08/02/2013 12:09 AM, Martin Wilck wrote:
> On 08/01/2013 11:13 PM, Martin Wilck wrote:
>> Can you please try the attached patch?
> 
> DON'T use it, it's broken. I will send a better one in a minute.

My latest patch series should fix this. The fix for your problem is 3/4,
and I added your test case as 4/4. Retest is welcome.

Martin

> 
> 
>> Regards,
>> Martin
>>
>>
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Bugreport ddf rebuild problems
  2013-08-01 22:37         ` Martin Wilck
@ 2013-08-03  9:43           ` Albert Pauw
  2013-08-04  9:47             ` Albert Pauw
  0 siblings, 1 reply; 22+ messages in thread
From: Albert Pauw @ 2013-08-03  9:43 UTC (permalink / raw)
  To: Martin Wilck; +Cc: linux-raid

Hi Martin,

I can confirm that patch 3/4 indeed fixed the problem.

Thanks for your quick fix.

Kind regards,

Albert

On 08/02/2013 12:37 AM, Martin Wilck wrote:
> Hi Albert,
>
> On 08/02/2013 12:09 AM, Martin Wilck wrote:
>> On 08/01/2013 11:13 PM, Martin Wilck wrote:
>>> Can you please try the attached patch?
>> DON'T use it, it's broken. I will send a better one in a minute.
> My latest patch series should fix this. The fix for your problem is 3/4,
> and I added your test case as 4/4. Retest is welcome.
>
> Martin
>
>>
>>> Regards,
>>> Martin
>>>
>>>




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Bugreport ddf rebuild problems
  2013-08-03  9:43           ` Albert Pauw
@ 2013-08-04  9:47             ` Albert Pauw
  2013-08-05 16:55               ` Albert Pauw
  0 siblings, 1 reply; 22+ messages in thread
From: Albert Pauw @ 2013-08-04  9:47 UTC (permalink / raw)
  To: Martin Wilck, linux-raid

Hi Martin,

I noticed another problem, with or without your patch, the problem occurs:

Zeroed the superblocks (just in case):
mdadm --zero-superblock /dev/loop[1-6]

Created the container:
mdadm -CR /dev/md127 -e ddf -l container -n 5 /dev/loop[1-5]

Created an md device in the container, it used /dev/loop4 and 
/dev/loop5, rebuild and finished
mdadm -CR /dev/md0 -l raid1 -n 2 /dev/md127

I fail one of the disks, it rebuild with one of the available unused 
disks in the container:
mdadm -f /dev/md0 /dev/loop4

I add another md device.
mdadm -CR /dev/md1 -l raid5 -n 3 /dev/md127

This last one is the odd bit, it is build using the previously failed 
disk. Looks like the container is not aware that /dev/loop4 has failed,
en reuses it. Which is wrong. So the failed status is not kept.

Regards,

Albert



On 08/03/2013 11:43 AM, Albert Pauw wrote:
> Hi Martin,
>
> I can confirm that patch 3/4 indeed fixed the problem.
>
> Thanks for your quick fix.
>
> Kind regards,
>
> Albert
>
> On 08/02/2013 12:37 AM, Martin Wilck wrote:
>> Hi Albert,
>>
>> On 08/02/2013 12:09 AM, Martin Wilck wrote:
>>> On 08/01/2013 11:13 PM, Martin Wilck wrote:
>>>> Can you please try the attached patch?
>>> DON'T use it, it's broken. I will send a better one in a minute.
>> My latest patch series should fix this. The fix for your problem is 3/4,
>> and I added your test case as 4/4. Retest is welcome.
>>
>> Martin
>>
>>>
>>>> Regards,
>>>> Martin
>>>>
>>>>
>




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Bugreport ddf rebuild problems
  2013-08-04  9:47             ` Albert Pauw
@ 2013-08-05 16:55               ` Albert Pauw
  2013-08-05 21:24                 ` Martin Wilck
  0 siblings, 1 reply; 22+ messages in thread
From: Albert Pauw @ 2013-08-05 16:55 UTC (permalink / raw)
  To: Martin Wilck, linux-raid, Neil Brown

Hi Neil/Martin,

I have pulled the latest git version (up and including Makefile: check 
that 'run' directory exists.).
It works, except for the last error I noticed, see below.

Regards,

Albert


On 08/04/2013 11:47 AM, Albert Pauw wrote:
> Hi Martin,
>
> I noticed another problem, with or without your patch, the problem 
> occurs:
>
> Zeroed the superblocks (just in case):
> mdadm --zero-superblock /dev/loop[1-6]
>
> Created the container:
> mdadm -CR /dev/md127 -e ddf -l container -n 5 /dev/loop[1-5]
>
> Created an md device in the container, it used /dev/loop4 and 
> /dev/loop5, rebuild and finished
> mdadm -CR /dev/md0 -l raid1 -n 2 /dev/md127
>
> I fail one of the disks, it rebuild with one of the available unused 
> disks in the container:
> mdadm -f /dev/md0 /dev/loop4
>
> I add another md device.
> mdadm -CR /dev/md1 -l raid5 -n 3 /dev/md127
>
> This last one is the odd bit, it is build using the previously failed 
> disk. Looks like the container is not aware that /dev/loop4 has failed,
> en reuses it. Which is wrong. So the failed status is not kept.
>
> Regards,
>
> Albert
>




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Bugreport ddf rebuild problems
  2013-08-05 16:55               ` Albert Pauw
@ 2013-08-05 21:24                 ` Martin Wilck
  2013-08-06  0:16                   ` NeilBrown
  0 siblings, 1 reply; 22+ messages in thread
From: Martin Wilck @ 2013-08-05 21:24 UTC (permalink / raw)
  To: Albert Pauw, Neil Brown; +Cc: linux-raid

Hi Albert, Neil,

I just submitted a new patch series; patch 3/5 integrates your 2nd case
as a new unit test and 4/5 should fix it.

However @Neil: I am not yet entirely happy with this solution. AFAICS
there is a possible race condition here, if a disk fails and mdadm -CR
is called to create a new array before the metadata reflecting the
failure is written to disk. If a disk failure happens in one array,
mdmon will call reconcile_failed() to propagate the failure to other
already known arrays in the same container, by writing "faulty" to the
sysfs state attribute. It can't do that for a new container though.

I thought that process_update() may need to check the kernel state of
array members against meta data state when a new VD configuration record
is received, but that's impossible because we can't call open() on the
respective sysfs files. It could be done in prepare_update(), but that
would require major changes, I wanted to ask you first.

Another option would be changing manage_new(). But we don't seem to have
a suitable metadata handler method to pass the meta data state to the
manager....

Ideas?

Regards
Martin


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Bugreport ddf rebuild problems
  2013-08-05 21:24                 ` Martin Wilck
@ 2013-08-06  0:16                   ` NeilBrown
  2013-08-06 21:26                     ` Martin Wilck
  0 siblings, 1 reply; 22+ messages in thread
From: NeilBrown @ 2013-08-06  0:16 UTC (permalink / raw)
  To: Martin Wilck; +Cc: Albert Pauw, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2612 bytes --]

On Mon, 05 Aug 2013 23:24:28 +0200 Martin Wilck <mwilck@arcor.de> wrote:

> Hi Albert, Neil,
> 
> I just submitted a new patch series; patch 3/5 integrates your 2nd case
> as a new unit test and 4/5 should fix it.
> 
> However @Neil: I am not yet entirely happy with this solution. AFAICS
> there is a possible race condition here, if a disk fails and mdadm -CR
> is called to create a new array before the metadata reflecting the
> failure is written to disk. If a disk failure happens in one array,
> mdmon will call reconcile_failed() to propagate the failure to other
> already known arrays in the same container, by writing "faulty" to the
> sysfs state attribute. It can't do that for a new container though.
> 
> I thought that process_update() may need to check the kernel state of
> array members against meta data state when a new VD configuration record
> is received, but that's impossible because we can't call open() on the
> respective sysfs files. It could be done in prepare_update(), but that
> would require major changes, I wanted to ask you first.
> 
> Another option would be changing manage_new(). But we don't seem to have
> a suitable metadata handler method to pass the meta data state to the
> manager....
> 
> Ideas?

Thanks for the patches - I applied them all.

Is there a race here?  When "mdadm -C" looks at the metadata the device will
either be an active member of another array, or it will be marked faulty.
Either way mdadm won't use it.

If the first array was created to use only (say) half of each device and the
second array was created with a size to fit in the other half of the device
then it might get interesting.
"mdadm -C" might see that everything looks good, create the array using the
second half of that drive that has just failed, and give that info to mdmon.

I suspect that ddf_open_new (which currently looks like it is just a stub)
needs to help out here.
When manage_new() gets told about a new array it will collect relevant info
from sysfs and call ->open_new() to make sure it matches the metadata.
ddf_open_new should check that all the devices in the array are recorded as
working in the metadata.  If any are failed, it can write 'faulty' to the
relevant state_fd.

Possibly the same thing can be done generically in manage_new() as you
suggested.  After the new array has been passed over to the monitor thread,
manage_new() could check if any devices should be failed much like
reconcile_failed() does and just fail them.

Does that make any sense?  Did I miss something?

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Bugreport ddf rebuild problems
  2013-08-06  0:16                   ` NeilBrown
@ 2013-08-06 21:26                     ` Martin Wilck
  2013-08-06 21:37                       ` Patches related to current discussion mwilck
                                         ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: Martin Wilck @ 2013-08-06 21:26 UTC (permalink / raw)
  To: NeilBrown; +Cc: Albert Pauw, linux-raid

On 08/06/2013 02:16 AM, NeilBrown wrote:
> On Mon, 05 Aug 2013 23:24:28 +0200 Martin Wilck <mwilck@arcor.de> wrote:
> 
>> Hi Albert, Neil,
>>
>> I just submitted a new patch series; patch 3/5 integrates your 2nd case
>> as a new unit test and 4/5 should fix it.
>>
>> However @Neil: I am not yet entirely happy with this solution. AFAICS
>> there is a possible race condition here, if a disk fails and mdadm -CR
>> is called to create a new array before the metadata reflecting the
>> failure is written to disk. If a disk failure happens in one array,
>> mdmon will call reconcile_failed() to propagate the failure to other
>> already known arrays in the same container, by writing "faulty" to the
>> sysfs state attribute. It can't do that for a new container though.
>>
>> I thought that process_update() may need to check the kernel state of
>> array members against meta data state when a new VD configuration record
>> is received, but that's impossible because we can't call open() on the
>> respective sysfs files. It could be done in prepare_update(), but that
>> would require major changes, I wanted to ask you first.
>>
>> Another option would be changing manage_new(). But we don't seem to have
>> a suitable metadata handler method to pass the meta data state to the
>> manager....
>>
>> Ideas?
> 
> Thanks for the patches - I applied them all.

I don't see them in the public repo yet.

> Is there a race here?  When "mdadm -C" looks at the metadata the device will
> either be an active member of another array, or it will be marked faulty.
> Either way mdadm won't use it.

That's right, thanks.

> If the first array was created to use only (say) half of each device and the
> second array was created with a size to fit in the other half of the device
> then it might get interesting.
> "mdadm -C" might see that everything looks good, create the array using the
> second half of that drive that has just failed, and give that info to mdmon.

Yes, I have created a test case for this (10ddf-fail-create-race) which
I am going to submit soon.

> I suspect that ddf_open_new (which currently looks like it is just a stub)
> needs to help out here.

Great idea, I made an implementation. I found that I needed to freeze
the array in Create(), too, to avoid the kernel starting a rebuild
before the mdmon checked the correctness of the new array. Please review
that, I'm not 100% positive I got it right.

> When manage_new() gets told about a new array it will collect relevant info
> from sysfs and call ->open_new() to make sure it matches the metadata.
> ddf_open_new should check that all the devices in the array are recorded as
> working in the metadata.  If any are failed, it can write 'faulty' to the
> relevant state_fd.
> 
> Possibly the same thing can be done generically in manage_new() as you
> suggested.  After the new array has been passed over to the monitor thread,
> manage_new() could check if any devices should be failed much like
> reconcile_failed() does and just fail them.
> 
> Does that make any sense?  Did I miss something?

It makes a lot of sense.

While testing, I found another minor problem case:

 1 disk fails in array taking half size
 2 mdmon activates spare
 3 mdadm -C is called and finds old meta data, allocates extent at
offset 0 on the spare
 4 Create() gets an error writing to the "size" sysfs attribute because
offset 0 has been grabbed by the spare recovery already

That's not too bad, after all, because the array won't be created. The
user just needs to re-issue his mdadm -C command which will now succeed
because the meta data should have been written to disk in the meantime.

That said, some kind of locking between mdadm and mdmon (mdadm won't
read meta data as long as mdmon is busy writing them) might be
desirable. It would be even better to do all meta data operations
through mdmon, mdadm just sending messages to it. That would be a major
architectural change for mdadm, but it would avoid this kind of
"different meta data here and there" problem altogether.

Thanks
Martin

> 
> Thanks,
> NeilBrown


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Patches related to current discussion
  2013-08-06 21:26                     ` Martin Wilck
@ 2013-08-06 21:37                       ` mwilck
  2013-08-06 21:38                       ` [PATCH 6/9] tests/10ddf-fail-spare: more sophisticated result checks mwilck
                                         ` (5 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: mwilck @ 2013-08-06 21:37 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: mwilck

Hi Neal

This series contains some fine tuning of Albert's spare activation
test, the new race condition test announced earlier, and two proposed
fixes to avoid it.

Starting at 6 to avoid confusion with the previous series.

Note that without reverting 273989b9 and ce45c819, the ddf tests will
fail anyway. I have a respective patch in my tree, but I'll leave it
to you how to solve that.

Martin


Martin Wilck (4):
  tests/10ddf-fail-spare: more sophisticated result checks
  tests/10ddf-fail-create-race: test handling of fail/create race
  DDF: ddf_open_new: check device status for new subarray
  Create: set array status to frozen until monitoring starts

 Create.c                     |    8 ++++
 managemon.c                  |    6 +++
 super-ddf.c                  |   27 +++++++++++++-
 tests/10ddf-fail-create-race |   66 +++++++++++++++++++++++++++++++++
 tests/10ddf-fail-spare       |   83 +++++++++++++++++++++++++++++++++---------
 5 files changed, 172 insertions(+), 18 deletions(-)
 create mode 100644 tests/10ddf-fail-create-race

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 6/9] tests/10ddf-fail-spare: more sophisticated result checks
  2013-08-06 21:26                     ` Martin Wilck
  2013-08-06 21:37                       ` Patches related to current discussion mwilck
@ 2013-08-06 21:38                       ` mwilck
  2013-08-06 21:38                       ` [PATCH 7/9] tests/10ddf-fail-create-race: test handling of fail/create race mwilck
                                         ` (4 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: mwilck @ 2013-08-06 21:38 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: mwilck

This test can succeed two ways, depending on timing.

Signed-off-by: Martin Wilck <mwilck@arcor.de>
---
 tests/10ddf-fail-spare |   83 ++++++++++++++++++++++++++++++++++++++----------
 1 files changed, 66 insertions(+), 17 deletions(-)

diff --git a/tests/10ddf-fail-spare b/tests/10ddf-fail-spare
index 60e8110..262fda4 100644
--- a/tests/10ddf-fail-spare
+++ b/tests/10ddf-fail-spare
@@ -1,37 +1,86 @@
-# sanity check array creation
+# Test suggested by Albert Pauw: Create, fail one disk, have mdmon
+# activate the spare,
+# then run create again. Shouldn't use the failed disk for Create,
 . tests/env-ddf-template
 
+tmp=$(mktemp /tmp/mdtest-XXXXXX)
+rm -f $tmp
+
 mdadm --zero-superblock $dev8 $dev9 $dev10 $dev11 $dev12 $dev13
 mdadm -CR $container -e ddf -l container -n 5 $dev8 $dev9 $dev10 $dev11 $dev12
 
-mdadm -CR $member0 -l raid1 -n 2 $container >/tmp/mdmon.txt 2>&1
+mdadm -CR $member0 -l raid1 -n 2 $container
 #$dir/mdadm -CR $member0 -l raid1 -n 2 $container >/tmp/mdmon.txt 2>&1
 
 check wait
 
 set -- $(get_raiddisks $member0)
 fail0=$1
-mdadm -f $member0 $fail0
+mdadm --fail $member0 $fail0
 
-# need to sleep shortly here, to give monitor some time to active the spare
-sleep 0.5
+# To make sure the spare is activated, we may have to sleep
+# 2s has always been enough for me
+sleep 2
 check wait
 
+# This test can succeed both ways - if spare was activated
+# before new array was created, we see only member 0.
+# otherwise, we see both, adn member0 is degraded because the
+# new array grabbed the spare
+# which case occurs depends on the sleep time above.
+ret=0
 if mdadm -CR $member1 -l raid5 -n 3 $container; then
-   echo error: create should have failed
-   set -- $(get_raiddisks $member0)
-   d0=$1
+   # Creation successful - must have been quicker than spare activation
+    
+   check wait
    set -- $(get_raiddisks $member1)
+   if [ $1 = $fail0 -o $2 = $fail0 -o $3 = $fail0 ]; then
+       echo ERROR: $member1 must not contain $fail0: $@
+       ret=1
+   fi
    d1=$1
-   cat /proc/mdstat
-   mdadm -E $d0
-   mdadm -E $d1
-   mdadm -E $fail0
-   rv=1
-#   cat /tmp/mdmon.txt
+   mdadm -E $d1 >$tmp
+   if ! grep -q 'state\[1\] : Optimal, Consistent' $tmp; then
+       echo ERROR: member 1 should be optimal in meta data
+       ret=1
+   fi
+   state0=Degraded
 else
-   rv=0
+   # Creation unsuccessful - spare was used for member 0
+   state0=Optimal
+fi
+
+# need to delay a little bit, sometimes the meta data aren't
+# up-to-date yet
+sleep 0.5
+set -- $(get_raiddisks $member0)
+if [ $1 = $fail0 -o $2 = $fail0 ]; then
+    echo ERROR: $member0 must not contain $fail0: $@
+    ret=1
 fi
+d0=$1
+
+[ -f $tmp ] || mdadm -E $d0 >$tmp
+
+if ! grep -q 'state\[0\] : '$state0', Consistent' $tmp; then
+    echo ERROR: member 0 should be $state0 in meta data
+    ret=1
+fi
+if ! grep -q 'Offline, Failed' $tmp; then
+    echo ERROR: Failed disk expected in meta data
+    ret=1
+fi
+if [ $ret -eq 1 ]; then
+    cat /proc/mdstat
+    mdadm -E $d0
+    mdadm -E $d1
+    mdadm -E $fail0
+fi
+
+[ -f /tmp/mdmon.txt ] && {
+    cat /tmp/mdmon.txt
+    rm -f /tmp/mdmon.txt
+}
 
-#  rm -f /tmp/mdmon.txt
-exit $rv
+rm -f $tmp
+[ $ret -eq 0 ]
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 7/9] tests/10ddf-fail-create-race: test handling of fail/create race
  2013-08-06 21:26                     ` Martin Wilck
  2013-08-06 21:37                       ` Patches related to current discussion mwilck
  2013-08-06 21:38                       ` [PATCH 6/9] tests/10ddf-fail-spare: more sophisticated result checks mwilck
@ 2013-08-06 21:38                       ` mwilck
  2013-08-06 21:38                       ` [PATCH 8/9] DDF: ddf_open_new: check device status for new subarray mwilck
                                         ` (3 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: mwilck @ 2013-08-06 21:38 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: mwilck

If a disk fails and simulaneously a new array is created, a race
condition may arise because the meta data on disk doesn't reflect
the disk failure yet. This is a test for that case.

Signed-off-by: Martin Wilck <mwilck@arcor.de>
---
 tests/10ddf-fail-create-race |   66 ++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 66 insertions(+), 0 deletions(-)
 create mode 100644 tests/10ddf-fail-create-race

diff --git a/tests/10ddf-fail-create-race b/tests/10ddf-fail-create-race
new file mode 100644
index 0000000..505636c
--- /dev/null
+++ b/tests/10ddf-fail-create-race
@@ -0,0 +1,66 @@
+# This test creates a RAID1, fails a disk, and immediately
+# (simultaneously) creates a new array. This tests for a possible
+# race where the meta data reflecting the disk failure may not
+# be written when the 2nd array is created.
+. tests/env-ddf-template
+
+mdadm --zero-superblock $dev8 $dev9 $dev10 $dev11 $dev12 $dev13
+
+mdadm -CR $container -e ddf -l container -n 2 $dev11 $dev12
+#$dir/mdadm -CR $member0 -l raid1 -n 2 $container -z 10000  >/tmp/mdmon.txt 2>&1
+mdadm -CR $member0 -l raid1 -n 2 $container -z 10000
+check wait
+fail0=$dev11
+mdadm --fail $member0 $fail0 &
+
+# The test can succeed two ways: 
+# 1) mdadm -C member1 fails - in this case the meta data
+# was already on disk when the create attempt was made
+# 2) mdadm -C succeeds in the first place (meta data not on disk yet),
+# but mdmon detects the problem and sets the disk faulty.
+
+if mdadm -CR $member1 -l raid1 -n 2 $container; then
+
+   echo create should have failed / race condition?
+   
+   check wait
+   set -- $(get_raiddisks $member0)
+   d0=$1
+   ret=0
+   if [ $1 = $fail0 -o $2 = $fail0 ]; then
+       ret=1
+   else 
+       set -- $(get_raiddisks $member1)
+       if [ $1 = $fail0 -o $2 = $fail0 ]; then
+	   ret=1
+       fi
+   fi
+   if [ $ret -eq 1 ]; then
+       echo ERROR: failed disk $fail0 is still a RAID member
+       echo $member0: $(get_raiddisks $member0)
+       echo $member1: $(get_raiddisks $member1)
+   fi
+   tmp=$(mktemp /tmp/mdest-XXXXXX)
+   mdadm -E $d0 >$tmp
+   if [ x$(grep -c 'state\[[01]\] : Degraded' $tmp) != x2 ]; then
+       echo ERROR: non-degraded array found
+       mdadm -E $d0
+       ret=1
+   fi
+   if ! grep -q '^  *0  *[0-9a-f]\{8\} .*Offline, Failed' $tmp; then
+       echo ERROR: disk 0 not marked as failed in meta data
+       mdadm -E $d0
+       ret=1
+   fi
+   rm -f $tmp
+else
+   ret=0
+fi
+
+[ -f /tmp/mdmon.txt ] && {
+    cat /tmp/mdmon.txt
+    rm -f /tmp/mdmon.txt
+}
+
+[ $ret -eq 0 ]
+
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 8/9] DDF: ddf_open_new: check device status for new subarray
  2013-08-06 21:26                     ` Martin Wilck
                                         ` (2 preceding siblings ...)
  2013-08-06 21:38                       ` [PATCH 7/9] tests/10ddf-fail-create-race: test handling of fail/create race mwilck
@ 2013-08-06 21:38                       ` mwilck
  2013-08-06 21:38                       ` [PATCH 9/9] Create: set array status to frozen until monitoring starts mwilck
                                         ` (2 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: mwilck @ 2013-08-06 21:38 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: mwilck

It is possible that mdadm creates a new subarray containing failed
devices. This may happen if a device has failed, but the meta data
containing that information hasn't been written out yet.

This code tests for this situation, and handles it in the monitor.

Signed-off-by: Martin Wilck <mwilck@arcor.de>
---
 super-ddf.c |   27 ++++++++++++++++++++++++++-
 1 files changed, 26 insertions(+), 1 deletions(-)

diff --git a/super-ddf.c b/super-ddf.c
index 23c5439..7f393ff 100644
--- a/super-ddf.c
+++ b/super-ddf.c
@@ -3996,11 +3996,36 @@ static int ddf_open_new(struct supertype *c, struct active_array *a, char *inst)
 {
 	struct ddf_super *ddf = c->sb;
 	int n = atoi(inst);
+	struct mdinfo *dev;
+	struct dl *dl;
+	static const char faulty[] = "faulty";
+
 	if (all_ff(ddf->virt->entries[n].guid)) {
 		pr_err("%s: subarray %d doesn't exist\n", __func__, n);
 		return -ENODEV;
 	}
-	dprintf("ddf: open_new %d\n", n);
+	dprintf("%s: new subarray %d, GUID: %s\n", __func__, n,
+		guid_str(ddf->virt->entries[n].guid));
+	for (dev = a->info.devs; dev; dev = dev->next) {
+		for (dl = ddf->dlist; dl; dl = dl->next)
+			if (dl->major == dev->disk.major &&
+			    dl->minor == dev->disk.minor)
+				break;
+		if (!dl) {
+			pr_err("%s: device %d/%d of subarray %d not found in meta data\n",
+				__func__, dev->disk.major, dev->disk.minor, n);
+			return -1;
+		}
+		if ((be16_to_cpu(ddf->phys->entries[dl->pdnum].state) &
+			(DDF_Online|DDF_Missing|DDF_Failed)) != DDF_Online) {
+			pr_err("%s: new subarray %d contains broken device %d/%d (%02x)\n",
+				__func__, n, dl->major, dl->minor,
+				be16_to_cpu(
+					ddf->phys->entries[dl->pdnum].state));
+			write(dev->state_fd, faulty, sizeof(faulty)-1);
+			dev->curr_state = DS_FAULTY;
+		}
+	}
 	a->info.container_member = n;
 	return 0;
 }
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 9/9] Create: set array status to frozen until monitoring starts
  2013-08-06 21:26                     ` Martin Wilck
                                         ` (3 preceding siblings ...)
  2013-08-06 21:38                       ` [PATCH 8/9] DDF: ddf_open_new: check device status for new subarray mwilck
@ 2013-08-06 21:38                       ` mwilck
  2013-08-08  0:44                         ` NeilBrown
  2013-08-07 18:07                       ` Bugreport ddf rebuild problems Albert Pauw
  2013-08-08  0:40                       ` NeilBrown
  6 siblings, 1 reply; 22+ messages in thread
From: mwilck @ 2013-08-06 21:38 UTC (permalink / raw)
  To: neilb, linux-raid; +Cc: mwilck

When we create an array while mdmon is working on an event
(e.g. disk failure), the meta data on disk may not be up-to-date.

Patch "DDF: ddf_open_new: check device status for new subarray"
added some checks for in the monitor for that situation - in particular,
to handle a freshly created array with faulty disks. The remaining
problem is that the kernel may start syncing the disks before this
situation is detected. This patch delays recovery until mdmon finished
checking.

tests/10ddf-fail-create-race should succeed reliably with this patch
and "DDF: ddf_open_new: check device status for new subarray". Without,
it will fail sporadically.

Signed-off-by: Martin Wilck <mwilck@arcor.de>
---
 Create.c    |    8 ++++++++
 managemon.c |    6 ++++++
 2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/Create.c b/Create.c
index ac22f77..f9b7db2 100644
--- a/Create.c
+++ b/Create.c
@@ -993,6 +993,14 @@ int Create(struct supertype *st, char *mddev,
 				need_mdmon = 0;
 				break;
 			default:
+				/*
+				 * The meta data we saw on disk may not be
+				 * up-to-date. The monitor will check and
+				 * possibly fail. Avoid a resync happening
+				 * in the kernel before that.
+				 */
+				sysfs_set_str(&info, NULL, "sync_action",
+					      "frozen");
 				err = sysfs_set_str(&info, NULL, "array_state",
 						    "readonly");
 				break;
diff --git a/managemon.c b/managemon.c
index f40bbdb..5bc54da 100644
--- a/managemon.c
+++ b/managemon.c
@@ -744,6 +744,12 @@ static void manage_new(struct mdstat_ent *mdstat,
 		new->container = NULL;
 		free_aa(new);
 	} else {
+		/*
+		 * Create() set this to frozen.
+		 * This relies on the kernel clear FROZEN status
+		 * if an invalid value is written to sync_action.
+		 */
+		sysfs_set_str(&new->info, NULL, "sync_action", "");
 		replace_array(container, victim, new);
 		if (failed) {
 			new->check_degraded = 1;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Bugreport ddf rebuild problems
  2013-08-06 21:26                     ` Martin Wilck
                                         ` (4 preceding siblings ...)
  2013-08-06 21:38                       ` [PATCH 9/9] Create: set array status to frozen until monitoring starts mwilck
@ 2013-08-07 18:07                       ` Albert Pauw
  2013-08-08  0:40                       ` NeilBrown
  6 siblings, 0 replies; 22+ messages in thread
From: Albert Pauw @ 2013-08-07 18:07 UTC (permalink / raw)
  To: Martin Wilck, Neil Brown, linux-raid

Just to let you guys know that I pulled the latest git version 
(including DDF: Write new conf entries with a single write.)

As for now, it all seems to work rather nicely.

Thanks guys,

Albert


On 08/06/2013 11:26 PM, Martin Wilck wrote:
> On 08/06/2013 02:16 AM, NeilBrown wrote:
>> On Mon, 05 Aug 2013 23:24:28 +0200 Martin Wilck <mwilck@arcor.de> wrote:
>>
>>> Hi Albert, Neil,
>>>
>>> I just submitted a new patch series; patch 3/5 integrates your 2nd case
>>> as a new unit test and 4/5 should fix it.
>>>
>>> However @Neil: I am not yet entirely happy with this solution. AFAICS
>>> there is a possible race condition here, if a disk fails and mdadm -CR
>>> is called to create a new array before the metadata reflecting the
>>> failure is written to disk. If a disk failure happens in one array,
>>> mdmon will call reconcile_failed() to propagate the failure to other
>>> already known arrays in the same container, by writing "faulty" to the
>>> sysfs state attribute. It can't do that for a new container though.
>>>
>>> I thought that process_update() may need to check the kernel state of
>>> array members against meta data state when a new VD configuration record
>>> is received, but that's impossible because we can't call open() on the
>>> respective sysfs files. It could be done in prepare_update(), but that
>>> would require major changes, I wanted to ask you first.
>>>
>>> Another option would be changing manage_new(). But we don't seem to have
>>> a suitable metadata handler method to pass the meta data state to the
>>> manager....
>>>
>>> Ideas?
>> Thanks for the patches - I applied them all.
> I don't see them in the public repo yet.
>
>> Is there a race here?  When "mdadm -C" looks at the metadata the device will
>> either be an active member of another array, or it will be marked faulty.
>> Either way mdadm won't use it.
> That's right, thanks.
>
>> If the first array was created to use only (say) half of each device and the
>> second array was created with a size to fit in the other half of the device
>> then it might get interesting.
>> "mdadm -C" might see that everything looks good, create the array using the
>> second half of that drive that has just failed, and give that info to mdmon.
> Yes, I have created a test case for this (10ddf-fail-create-race) which
> I am going to submit soon.
>
>> I suspect that ddf_open_new (which currently looks like it is just a stub)
>> needs to help out here.
> Great idea, I made an implementation. I found that I needed to freeze
> the array in Create(), too, to avoid the kernel starting a rebuild
> before the mdmon checked the correctness of the new array. Please review
> that, I'm not 100% positive I got it right.
>
>> When manage_new() gets told about a new array it will collect relevant info
>> from sysfs and call ->open_new() to make sure it matches the metadata.
>> ddf_open_new should check that all the devices in the array are recorded as
>> working in the metadata.  If any are failed, it can write 'faulty' to the
>> relevant state_fd.
>>
>> Possibly the same thing can be done generically in manage_new() as you
>> suggested.  After the new array has been passed over to the monitor thread,
>> manage_new() could check if any devices should be failed much like
>> reconcile_failed() does and just fail them.
>>
>> Does that make any sense?  Did I miss something?
> It makes a lot of sense.
>
> While testing, I found another minor problem case:
>
>   1 disk fails in array taking half size
>   2 mdmon activates spare
>   3 mdadm -C is called and finds old meta data, allocates extent at
> offset 0 on the spare
>   4 Create() gets an error writing to the "size" sysfs attribute because
> offset 0 has been grabbed by the spare recovery already
>
> That's not too bad, after all, because the array won't be created. The
> user just needs to re-issue his mdadm -C command which will now succeed
> because the meta data should have been written to disk in the meantime.
>
> That said, some kind of locking between mdadm and mdmon (mdadm won't
> read meta data as long as mdmon is busy writing them) might be
> desirable. It would be even better to do all meta data operations
> through mdmon, mdadm just sending messages to it. That would be a major
> architectural change for mdadm, but it would avoid this kind of
> "different meta data here and there" problem altogether.
>
> Thanks
> Martin
>
>> Thanks,
>> NeilBrown




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Bugreport ddf rebuild problems
  2013-08-06 21:26                     ` Martin Wilck
                                         ` (5 preceding siblings ...)
  2013-08-07 18:07                       ` Bugreport ddf rebuild problems Albert Pauw
@ 2013-08-08  0:40                       ` NeilBrown
  6 siblings, 0 replies; 22+ messages in thread
From: NeilBrown @ 2013-08-08  0:40 UTC (permalink / raw)
  To: Martin Wilck; +Cc: Albert Pauw, linux-raid

[-- Attachment #1: Type: text/plain, Size: 5019 bytes --]

On Tue, 06 Aug 2013 23:26:30 +0200 Martin Wilck <mwilck@arcor.de> wrote:

> On 08/06/2013 02:16 AM, NeilBrown wrote:
> > On Mon, 05 Aug 2013 23:24:28 +0200 Martin Wilck <mwilck@arcor.de> wrote:
> > 
> >> Hi Albert, Neil,
> >>
> >> I just submitted a new patch series; patch 3/5 integrates your 2nd case
> >> as a new unit test and 4/5 should fix it.
> >>
> >> However @Neil: I am not yet entirely happy with this solution. AFAICS
> >> there is a possible race condition here, if a disk fails and mdadm -CR
> >> is called to create a new array before the metadata reflecting the
> >> failure is written to disk. If a disk failure happens in one array,
> >> mdmon will call reconcile_failed() to propagate the failure to other
> >> already known arrays in the same container, by writing "faulty" to the
> >> sysfs state attribute. It can't do that for a new container though.
> >>
> >> I thought that process_update() may need to check the kernel state of
> >> array members against meta data state when a new VD configuration record
> >> is received, but that's impossible because we can't call open() on the
> >> respective sysfs files. It could be done in prepare_update(), but that
> >> would require major changes, I wanted to ask you first.
> >>
> >> Another option would be changing manage_new(). But we don't seem to have
> >> a suitable metadata handler method to pass the meta data state to the
> >> manager....
> >>
> >> Ideas?
> > 
> > Thanks for the patches - I applied them all.
> 
> I don't see them in the public repo yet.
> 
> > Is there a race here?  When "mdadm -C" looks at the metadata the device will
> > either be an active member of another array, or it will be marked faulty.
> > Either way mdadm won't use it.
> 
> That's right, thanks.
> 
> > If the first array was created to use only (say) half of each device and the
> > second array was created with a size to fit in the other half of the device
> > then it might get interesting.
> > "mdadm -C" might see that everything looks good, create the array using the
> > second half of that drive that has just failed, and give that info to mdmon.
> 
> Yes, I have created a test case for this (10ddf-fail-create-race) which
> I am going to submit soon.
> 
> > I suspect that ddf_open_new (which currently looks like it is just a stub)
> > needs to help out here.
> 
> Great idea, I made an implementation. I found that I needed to freeze
> the array in Create(), too, to avoid the kernel starting a rebuild
> before the mdmon checked the correctness of the new array. Please review
> that, I'm not 100% positive I got it right.
> 
> > When manage_new() gets told about a new array it will collect relevant info
> > from sysfs and call ->open_new() to make sure it matches the metadata.
> > ddf_open_new should check that all the devices in the array are recorded as
> > working in the metadata.  If any are failed, it can write 'faulty' to the
> > relevant state_fd.
> > 
> > Possibly the same thing can be done generically in manage_new() as you
> > suggested.  After the new array has been passed over to the monitor thread,
> > manage_new() could check if any devices should be failed much like
> > reconcile_failed() does and just fail them.
> > 
> > Does that make any sense?  Did I miss something?
> 
> It makes a lot of sense.
> 
> While testing, I found another minor problem case:
> 
>  1 disk fails in array taking half size
>  2 mdmon activates spare
>  3 mdadm -C is called and finds old meta data, allocates extent at
> offset 0 on the spare
>  4 Create() gets an error writing to the "size" sysfs attribute because
> offset 0 has been grabbed by the spare recovery already
> 
> That's not too bad, after all, because the array won't be created. The
> user just needs to re-issue his mdadm -C command which will now succeed
> because the meta data should have been written to disk in the meantime.
> 
> That said, some kind of locking between mdadm and mdmon (mdadm won't
> read meta data as long as mdmon is busy writing them) might be
> desirable. It would be even better to do all meta data operations
> through mdmon, mdadm just sending messages to it. That would be a major
> architectural change for mdadm, but it would avoid this kind of
> "different meta data here and there" problem altogether.

I think the way this locking "should" happen is with an O_EXCL lock on the
container.
mdadm and mdmon already do this to some extent, but it is hard to find :-)

The most obvious usage is that mdmon insists on getting an O_EXCL open before
it will exit.  --create also gets an O_EXCL open, so mdmon cannot
exit while a new array might be being created.
See "When creating a member, we need to be careful" in Create.c and
" No interesting arrays, or we have been told to" in monitor.c

So this race could be closed by manager getting an O_EXCL open on the
container before performing spare management.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 9/9] Create: set array status to frozen until monitoring starts
  2013-08-06 21:38                       ` [PATCH 9/9] Create: set array status to frozen until monitoring starts mwilck
@ 2013-08-08  0:44                         ` NeilBrown
  2013-08-08  7:31                           ` Martin Wilck
  0 siblings, 1 reply; 22+ messages in thread
From: NeilBrown @ 2013-08-08  0:44 UTC (permalink / raw)
  To: mwilck; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2796 bytes --]

On Tue,  6 Aug 2013 23:38:03 +0200 mwilck@arcor.de wrote:

> When we create an array while mdmon is working on an event
> (e.g. disk failure), the meta data on disk may not be up-to-date.
> 
> Patch "DDF: ddf_open_new: check device status for new subarray"
> added some checks for in the monitor for that situation - in particular,
> to handle a freshly created array with faulty disks. The remaining
> problem is that the kernel may start syncing the disks before this
> situation is detected. This patch delays recovery until mdmon finished
> checking.
> 
> tests/10ddf-fail-create-race should succeed reliably with this patch
> and "DDF: ddf_open_new: check device status for new subarray". Without,
> it will fail sporadically.
> 
> Signed-off-by: Martin Wilck <mwilck@arcor.de>
> ---
>  Create.c    |    8 ++++++++
>  managemon.c |    6 ++++++
>  2 files changed, 14 insertions(+), 0 deletions(-)
> 
> diff --git a/Create.c b/Create.c
> index ac22f77..f9b7db2 100644
> --- a/Create.c
> +++ b/Create.c
> @@ -993,6 +993,14 @@ int Create(struct supertype *st, char *mddev,
>  				need_mdmon = 0;
>  				break;
>  			default:
> +				/*
> +				 * The meta data we saw on disk may not be
> +				 * up-to-date. The monitor will check and
> +				 * possibly fail. Avoid a resync happening
> +				 * in the kernel before that.
> +				 */
> +				sysfs_set_str(&info, NULL, "sync_action",
> +					      "frozen");
>  				err = sysfs_set_str(&info, NULL, "array_state",
>  						    "readonly");
>  				break;
> diff --git a/managemon.c b/managemon.c
> index f40bbdb..5bc54da 100644
> --- a/managemon.c
> +++ b/managemon.c
> @@ -744,6 +744,12 @@ static void manage_new(struct mdstat_ent *mdstat,
>  		new->container = NULL;
>  		free_aa(new);
>  	} else {
> +		/*
> +		 * Create() set this to frozen.
> +		 * This relies on the kernel clear FROZEN status
> +		 * if an invalid value is written to sync_action.
> +		 */
> +		sysfs_set_str(&new->info, NULL, "sync_action", "");
>  		replace_array(container, victim, new);
>  		if (failed) {
>  			new->check_degraded = 1;


I don't think that this patch should be necessary.
If you find it makes a difference, then I'm missing something.

Recovery never starts while the array is readonly - the kernel doesn't allow
it.
So mdmon should be in complete control as it is the only one that is allowed
to change the array from readonly.

Whether it quite does the right thing in this case I can't promise, but it
should be able to do the right thing without any extra help from mdadm.

And writing an empty string to 'sync_action' will not do anything useful.
You probably mean to write "idle".

I've applied and push out the previous patches but not this one.

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 9/9] Create: set array status to frozen until monitoring starts
  2013-08-08  0:44                         ` NeilBrown
@ 2013-08-08  7:31                           ` Martin Wilck
  0 siblings, 0 replies; 22+ messages in thread
From: Martin Wilck @ 2013-08-08  7:31 UTC (permalink / raw)
  To: NeilBrown, linux-raid

On 08/08/2013 02:44 AM, NeilBrown wrote:
> On Tue,  6 Aug 2013 23:38:03 +0200 mwilck@arcor.de wrote:
> 
>> When we create an array while mdmon is working on an event
>> (e.g. disk failure), the meta data on disk may not be up-to-date.
>>
>> Patch "DDF: ddf_open_new: check device status for new subarray"
>> added some checks for in the monitor for that situation - in particular,
>> to handle a freshly created array with faulty disks. The remaining
>> problem is that the kernel may start syncing the disks before this
>> situation is detected. This patch delays recovery until mdmon finished
>> checking.
>>
>> tests/10ddf-fail-create-race should succeed reliably with this patch
>> and "DDF: ddf_open_new: check device status for new subarray". Without,
>> it will fail sporadically.
>>
>> Signed-off-by: Martin Wilck <mwilck@arcor.de>
>> ---
>>  Create.c    |    8 ++++++++
>>  managemon.c |    6 ++++++
>>  2 files changed, 14 insertions(+), 0 deletions(-)
>>
>> diff --git a/Create.c b/Create.c
>> index ac22f77..f9b7db2 100644
>> --- a/Create.c
>> +++ b/Create.c
>> @@ -993,6 +993,14 @@ int Create(struct supertype *st, char *mddev,
>>  				need_mdmon = 0;
>>  				break;
>>  			default:
>> +				/*
>> +				 * The meta data we saw on disk may not be
>> +				 * up-to-date. The monitor will check and
>> +				 * possibly fail. Avoid a resync happening
>> +				 * in the kernel before that.
>> +				 */
>> +				sysfs_set_str(&info, NULL, "sync_action",
>> +					      "frozen");
>>  				err = sysfs_set_str(&info, NULL, "array_state",
>>  						    "readonly");
>>  				break;
>> diff --git a/managemon.c b/managemon.c
>> index f40bbdb..5bc54da 100644
>> --- a/managemon.c
>> +++ b/managemon.c
>> @@ -744,6 +744,12 @@ static void manage_new(struct mdstat_ent *mdstat,
>>  		new->container = NULL;
>>  		free_aa(new);
>>  	} else {
>> +		/*
>> +		 * Create() set this to frozen.
>> +		 * This relies on the kernel clear FROZEN status
>> +		 * if an invalid value is written to sync_action.
>> +		 */
>> +		sysfs_set_str(&new->info, NULL, "sync_action", "");
>>  		replace_array(container, victim, new);
>>  		if (failed) {
>>  			new->check_degraded = 1;
> 
> 
> I don't think that this patch should be necessary.
> If you find it makes a difference, then I'm missing something.
> 
> Recovery never starts while the array is readonly - the kernel doesn't allow
> it.
> So mdmon should be in complete control as it is the only one that is allowed
> to change the array from readonly.
> 
> Whether it quite does the right thing in this case I can't promise, but it
> should be able to do the right thing without any extra help from mdadm.

I had a look at the dmesg output and it appeared to me that the kernel
started resync on the new array before it noticed that one of the disks
were faulty. But I may have been mistaken. Unfortunately I didn't keep
those logs. For now I am fine with what you are saying - you know the
ways of the kernel MD driver way better then me.

> And writing an empty string to 'sync_action' will not do anything useful.
> You probably mean to write "idle".

Hmm...

4223 static ssize_t
4224 action_store(struct mddev *mddev, const char *page, size_t len)
4225 {
...
4229     if (cmd_match(page, "frozen"))
4230         set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
4231     else
4232         clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);

To me this looks as if FROZEN was cleared whenever anything else but
"frozen" was written. What am I missing here? I actually meant to do
exactly that because I didn't want to loose the previous status. But as
I said, I am not familiar enough with the kernel md code to understand
exactly what happens.

> I've applied and push out the previous patches but not this one.

Thanks.
Martin



-- 
Dr. Martin Wilck
PRIMERGY System Software Engineer
x86 Server Engineering

FUJITSU
Fujitsu Technology Solutions GmbH
Heinz-Nixdorf-Ring 1
33106 Paderborn, Germany
Phone:			++49 5251 525 2796
Fax:			++49 5251 525 2820
Email:			martin.wilck@ts.fujitsu.com
Internet:		http://ts.fujitsu.com
Company Details:	http://ts.fujitsu.com/imprint

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Bugreport ddf rebuild problems
  2013-08-05  6:21 ` NeilBrown
@ 2013-08-05  7:17   ` Albert Pauw
  0 siblings, 0 replies; 22+ messages in thread
From: Albert Pauw @ 2013-08-05  7:17 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,

I will check the latest git version tonight and let you know.

Thanks to you and Martin Wilck,

Albert

On 5 August 2013 08:21, NeilBrown <neilb@suse.de> wrote:
> On Thu, 1 Aug 2013 21:22:44 +0200 Albert Pauw <albert.pauw@gmail.com> wrote:
>
>> Now I've got it working, I checked a simple rebuild test with a ddf
>> container, containing two raidsets (raid 1 and raid 5).
>>
>> Having 6 loopback devices of 512 MB each I opened in a separate screen this
>> to monitor what happens:
>>
>> while true; do clear; cat /proc/mdstat ; sleep 1; done
>>
>
> For future reference:
>    watch cat /proc/mdstat
>
> works very nicely for this sort of task.
>
> I believe the bug you described is fixed in the current git tree.  If you
> could confirm I would appreciate it.
>
> Thanks,
> NeilBrown

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Bugreport ddf rebuild problems
       [not found] <CAGkViCHPvbmcehFvACBKVFFCw+DdnjqvK2uNGmvKrFki+n9n-Q@mail.gmail.com>
@ 2013-08-05  6:21 ` NeilBrown
  2013-08-05  7:17   ` Albert Pauw
  0 siblings, 1 reply; 22+ messages in thread
From: NeilBrown @ 2013-08-05  6:21 UTC (permalink / raw)
  To: Albert Pauw; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 620 bytes --]

On Thu, 1 Aug 2013 21:22:44 +0200 Albert Pauw <albert.pauw@gmail.com> wrote:

> Now I've got it working, I checked a simple rebuild test with a ddf
> container, containing two raidsets (raid 1 and raid 5).
> 
> Having 6 loopback devices of 512 MB each I opened in a separate screen this
> to monitor what happens:
> 
> while true; do clear; cat /proc/mdstat ; sleep 1; done
> 

For future reference:
   watch cat /proc/mdstat

works very nicely for this sort of task.

I believe the bug you described is fixed in the current git tree.  If you
could confirm I would appreciate it.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2013-08-08  7:31 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-01 19:30 Bugreport ddf rebuild problems Albert Pauw
2013-08-01 19:09 ` Martin Wilck
     [not found]   ` <CAGkViCFT0+qm9YAnAJM7JGgVg0RTJi8=HAYDTMs-mfhXinqdcg@mail.gmail.com>
2013-08-01 21:13     ` Martin Wilck
2013-08-01 22:09       ` Martin Wilck
2013-08-01 22:37         ` Martin Wilck
2013-08-03  9:43           ` Albert Pauw
2013-08-04  9:47             ` Albert Pauw
2013-08-05 16:55               ` Albert Pauw
2013-08-05 21:24                 ` Martin Wilck
2013-08-06  0:16                   ` NeilBrown
2013-08-06 21:26                     ` Martin Wilck
2013-08-06 21:37                       ` Patches related to current discussion mwilck
2013-08-06 21:38                       ` [PATCH 6/9] tests/10ddf-fail-spare: more sophisticated result checks mwilck
2013-08-06 21:38                       ` [PATCH 7/9] tests/10ddf-fail-create-race: test handling of fail/create race mwilck
2013-08-06 21:38                       ` [PATCH 8/9] DDF: ddf_open_new: check device status for new subarray mwilck
2013-08-06 21:38                       ` [PATCH 9/9] Create: set array status to frozen until monitoring starts mwilck
2013-08-08  0:44                         ` NeilBrown
2013-08-08  7:31                           ` Martin Wilck
2013-08-07 18:07                       ` Bugreport ddf rebuild problems Albert Pauw
2013-08-08  0:40                       ` NeilBrown
     [not found] <CAGkViCHPvbmcehFvACBKVFFCw+DdnjqvK2uNGmvKrFki+n9n-Q@mail.gmail.com>
2013-08-05  6:21 ` NeilBrown
2013-08-05  7:17   ` Albert Pauw

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.