All of lore.kernel.org
 help / color / mirror / Atom feed
* tests/03r5assemV1 issues
@ 2012-07-02 13:24 Jes Sorensen
  2012-07-03  1:44 ` NeilBrown
  0 siblings, 1 reply; 8+ messages in thread
From: Jes Sorensen @ 2012-07-02 13:24 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,

I am trying to get the test suite stable on RHEL, but I see a lot of
failures in 03r5assemV1, in particular between these two cases:

mdadm -A $md1 -u $uuid $devlist
check state U_U
eval $tst

mdadm -A $md1 --name=one $devlist
check state U_U
check spares 1
eval $tst

I have tested it with the latest upstream kernel as well and see the
same problems. I suspect it is simply the box that is too fast, ending
up with the raid check completing inbetween the two test cases?

Are you seeing the same thing there? I tried playing with the max speed
variable but it doesn't really seem to make any difference.

Any ideas for what we can be done to make this case more resilient to
false positives? I guess one option would be to re-create the array
inbetween each test?

Cheers,
Jes

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: tests/03r5assemV1 issues
  2012-07-02 13:24 tests/03r5assemV1 issues Jes Sorensen
@ 2012-07-03  1:44 ` NeilBrown
  2012-07-03 16:07   ` Jes Sorensen
  0 siblings, 1 reply; 8+ messages in thread
From: NeilBrown @ 2012-07-03  1:44 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1326 bytes --]

On Mon, 02 Jul 2012 15:24:43 +0200 Jes Sorensen <Jes.Sorensen@redhat.com>
wrote:

> Hi Neil,
> 
> I am trying to get the test suite stable on RHEL, but I see a lot of
> failures in 03r5assemV1, in particular between these two cases:
> 
> mdadm -A $md1 -u $uuid $devlist
> check state U_U
> eval $tst
> 
> mdadm -A $md1 --name=one $devlist
> check state U_U
> check spares 1
> eval $tst
> 
> I have tested it with the latest upstream kernel as well and see the
> same problems. I suspect it is simply the box that is too fast, ending
> up with the raid check completing inbetween the two test cases?
> 
> Are you seeing the same thing there? I tried playing with the max speed
> variable but it doesn't really seem to make any difference.
> 
> Any ideas for what we can be done to make this case more resilient to
> false positives? I guess one option would be to re-create the array
> inbetween each test?
> 
> Cheers,
> Jes

Maybe it really is a bug?
The test harness set the resync speed to be very slow.  A fast box will get
through the test more quickly and be more likely to see the array still
syncing.

I'll try to make time to look more closely.
But I wouldn't discount the possibility that the second "mdadm -A" is
short-circuiting the recovery somehow.

thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: tests/03r5assemV1 issues
  2012-07-03  1:44 ` NeilBrown
@ 2012-07-03 16:07   ` Jes Sorensen
  2012-07-04  5:23     ` NeilBrown
  0 siblings, 1 reply; 8+ messages in thread
From: Jes Sorensen @ 2012-07-03 16:07 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

NeilBrown <neilb@suse.de> writes:
> On Mon, 02 Jul 2012 15:24:43 +0200 Jes Sorensen <Jes.Sorensen@redhat.com>
> wrote:
>
>> Hi Neil,
>> 
>> I am trying to get the test suite stable on RHEL, but I see a lot of
>> failures in 03r5assemV1, in particular between these two cases:
>> 
>> mdadm -A $md1 -u $uuid $devlist
>> check state U_U
>> eval $tst
>> 
>> mdadm -A $md1 --name=one $devlist
>> check state U_U
>> check spares 1
>> eval $tst
>> 
>> I have tested it with the latest upstream kernel as well and see the
>> same problems. I suspect it is simply the box that is too fast, ending
>> up with the raid check completing inbetween the two test cases?
>> 
>> Are you seeing the same thing there? I tried playing with the max speed
>> variable but it doesn't really seem to make any difference.
>> 
>> Any ideas for what we can be done to make this case more resilient to
>> false positives? I guess one option would be to re-create the array
>> inbetween each test?
>
> Maybe it really is a bug?
> The test harness set the resync speed to be very slow.  A fast box will get
> through the test more quickly and be more likely to see the array still
> syncing.
>
> I'll try to make time to look more closely.
> But I wouldn't discount the possibility that the second "mdadm -A" is
> short-circuiting the recovery somehow.

That could certainly explain what I am seeing. I noticed it doesn't
happen every single time in the same place (from memory), but it is
mostly in that spot in my case.

Even if I trimmed the max speed down to 50 it still happens.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: tests/03r5assemV1 issues
  2012-07-03 16:07   ` Jes Sorensen
@ 2012-07-04  5:23     ` NeilBrown
  2012-07-06  9:59       ` Jes Sorensen
  0 siblings, 1 reply; 8+ messages in thread
From: NeilBrown @ 2012-07-04  5:23 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1941 bytes --]

On Tue, 03 Jul 2012 18:07:02 +0200 Jes Sorensen <Jes.Sorensen@redhat.com>
wrote:

> NeilBrown <neilb@suse.de> writes:
> > On Mon, 02 Jul 2012 15:24:43 +0200 Jes Sorensen <Jes.Sorensen@redhat.com>
> > wrote:
> >
> >> Hi Neil,
> >> 
> >> I am trying to get the test suite stable on RHEL, but I see a lot of
> >> failures in 03r5assemV1, in particular between these two cases:
> >> 
> >> mdadm -A $md1 -u $uuid $devlist
> >> check state U_U
> >> eval $tst
> >> 
> >> mdadm -A $md1 --name=one $devlist
> >> check state U_U
> >> check spares 1
> >> eval $tst
> >> 
> >> I have tested it with the latest upstream kernel as well and see the
> >> same problems. I suspect it is simply the box that is too fast, ending
> >> up with the raid check completing inbetween the two test cases?
> >> 
> >> Are you seeing the same thing there? I tried playing with the max speed
> >> variable but it doesn't really seem to make any difference.
> >> 
> >> Any ideas for what we can be done to make this case more resilient to
> >> false positives? I guess one option would be to re-create the array
> >> inbetween each test?
> >
> > Maybe it really is a bug?
> > The test harness set the resync speed to be very slow.  A fast box will get
> > through the test more quickly and be more likely to see the array still
> > syncing.
> >
> > I'll try to make time to look more closely.
> > But I wouldn't discount the possibility that the second "mdadm -A" is
> > short-circuiting the recovery somehow.
> 
> That could certainly explain what I am seeing. I noticed it doesn't
> happen every single time in the same place (from memory), but it is
> mostly in that spot in my case.
> 
> Even if I trimmed the max speed down to 50 it still happens.

I cannot easily reproduce this.
Exactly which kernel and which mdadm do you find it with - just to make sure
I'm testing the same thing as you?

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: tests/03r5assemV1 issues
  2012-07-04  5:23     ` NeilBrown
@ 2012-07-06  9:59       ` Jes Sorensen
  2012-07-11  4:20         ` NeilBrown
  0 siblings, 1 reply; 8+ messages in thread
From: Jes Sorensen @ 2012-07-06  9:59 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

NeilBrown <neilb@suse.de> writes:
> On Tue, 03 Jul 2012 18:07:02 +0200 Jes Sorensen <Jes.Sorensen@redhat.com>
> wrote:
>
>> NeilBrown <neilb@suse.de> writes:
>> > On Mon, 02 Jul 2012 15:24:43 +0200 Jes Sorensen <Jes.Sorensen@redhat.com>
>> > wrote:
>> >
>> >> Hi Neil,
>> >> 
>> >> I am trying to get the test suite stable on RHEL, but I see a lot of
>> >> failures in 03r5assemV1, in particular between these two cases:
>> >> 
>> >> mdadm -A $md1 -u $uuid $devlist
>> >> check state U_U
>> >> eval $tst
>> >> 
>> >> mdadm -A $md1 --name=one $devlist
>> >> check state U_U
>> >> check spares 1
>> >> eval $tst
>> >> 
>> >> I have tested it with the latest upstream kernel as well and see the
>> >> same problems. I suspect it is simply the box that is too fast, ending
>> >> up with the raid check completing inbetween the two test cases?
>> >> 
>> >> Are you seeing the same thing there? I tried playing with the max speed
>> >> variable but it doesn't really seem to make any difference.
>> >> 
>> >> Any ideas for what we can be done to make this case more resilient to
>> >> false positives? I guess one option would be to re-create the array
>> >> inbetween each test?
>> >
>> > Maybe it really is a bug?
>> > The test harness set the resync speed to be very slow.  A fast box will get
>> > through the test more quickly and be more likely to see the array still
>> > syncing.
>> >
>> > I'll try to make time to look more closely.
>> > But I wouldn't discount the possibility that the second "mdadm -A" is
>> > short-circuiting the recovery somehow.
>> 
>> That could certainly explain what I am seeing. I noticed it doesn't
>> happen every single time in the same place (from memory), but it is
>> mostly in that spot in my case.
>> 
>> Even if I trimmed the max speed down to 50 it still happens.
>
> I cannot easily reproduce this.
> Exactly which kernel and which mdadm do you find it with - just to make sure
> I'm testing the same thing as you?

Hi Neil,

Odd - I see it with
mdadm:  721b662b5b33830090c220bbb04bf1904d4b7eed
kernel: ca24a145573124732152daff105ba68cc9a2b545

I've seen this happen for a while fwiw.

Note the box has a number of external drives with a number of my scratch
raid arrays on it. It shouldn't affect this, but just in case.

The system installed mdadm is a 3.2.3 derivative, but I checked running
with PATH=. as well.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: tests/03r5assemV1 issues
  2012-07-06  9:59       ` Jes Sorensen
@ 2012-07-11  4:20         ` NeilBrown
  2012-07-11  4:28           ` Roman Mamedov
  2012-07-11  7:18           ` Jes Sorensen
  0 siblings, 2 replies; 8+ messages in thread
From: NeilBrown @ 2012-07-11  4:20 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3677 bytes --]

On Fri, 06 Jul 2012 11:59:13 +0200 Jes Sorensen <Jes.Sorensen@redhat.com>
wrote:

> NeilBrown <neilb@suse.de> writes:
> > On Tue, 03 Jul 2012 18:07:02 +0200 Jes Sorensen <Jes.Sorensen@redhat.com>
> > wrote:
> >
> >> NeilBrown <neilb@suse.de> writes:
> >> > On Mon, 02 Jul 2012 15:24:43 +0200 Jes Sorensen <Jes.Sorensen@redhat.com>
> >> > wrote:
> >> >
> >> >> Hi Neil,
> >> >> 
> >> >> I am trying to get the test suite stable on RHEL, but I see a lot of
> >> >> failures in 03r5assemV1, in particular between these two cases:
> >> >> 
> >> >> mdadm -A $md1 -u $uuid $devlist
> >> >> check state U_U
> >> >> eval $tst
> >> >> 
> >> >> mdadm -A $md1 --name=one $devlist
> >> >> check state U_U
> >> >> check spares 1
> >> >> eval $tst
> >> >> 
> >> >> I have tested it with the latest upstream kernel as well and see the
> >> >> same problems. I suspect it is simply the box that is too fast, ending
> >> >> up with the raid check completing inbetween the two test cases?
> >> >> 
> >> >> Are you seeing the same thing there? I tried playing with the max speed
> >> >> variable but it doesn't really seem to make any difference.
> >> >> 
> >> >> Any ideas for what we can be done to make this case more resilient to
> >> >> false positives? I guess one option would be to re-create the array
> >> >> inbetween each test?
> >> >
> >> > Maybe it really is a bug?
> >> > The test harness set the resync speed to be very slow.  A fast box will get
> >> > through the test more quickly and be more likely to see the array still
> >> > syncing.
> >> >
> >> > I'll try to make time to look more closely.
> >> > But I wouldn't discount the possibility that the second "mdadm -A" is
> >> > short-circuiting the recovery somehow.
> >> 
> >> That could certainly explain what I am seeing. I noticed it doesn't
> >> happen every single time in the same place (from memory), but it is
> >> mostly in that spot in my case.
> >> 
> >> Even if I trimmed the max speed down to 50 it still happens.
> >
> > I cannot easily reproduce this.
> > Exactly which kernel and which mdadm do you find it with - just to make sure
> > I'm testing the same thing as you?
> 
> Hi Neil,
> 
> Odd - I see it with
> mdadm:  721b662b5b33830090c220bbb04bf1904d4b7eed
> kernel: ca24a145573124732152daff105ba68cc9a2b545
> 
> I've seen this happen for a while fwiw.
> 
> Note the box has a number of external drives with a number of my scratch
> raid arrays on it. It shouldn't affect this, but just in case.
> 
> The system installed mdadm is a 3.2.3 derivative, but I checked running
> with PATH=. as well.

Thanks.
I think I figured out what is happening.

It seems that setting the max_speed down to 1000 is often enough, but not
always.  So we need to set it lower.
But setting max_speed lowers is not effective unless you also set min_speed
lower.  This is the tricky bit that took me way too long to realised.

So with this patch, it is quite reliable.

NeilBrown

diff --git a/tests/03r5assemV1 b/tests/03r5assemV1
index 52b1107..bca0c58 100644
--- a/tests/03r5assemV1
+++ b/tests/03r5assemV1
@@ -60,7 +60,8 @@ eval $tst
 ### Now with a missing device
 # We don't want the recovery to complete while we are
 # messing about here.
-echo 1000 > /proc/sys/dev/raid/speed_limit_max
+echo 100 > /proc/sys/dev/raid/speed_limit_max
+echo 100 > /proc/sys/dev/raid/speed_limit_min
 
 mdadm -AR $md1 $dev0 $dev2 $dev3 $dev4 #
 check state U_U
@@ -124,3 +125,4 @@ mdadm -I -c $conf $dev1
 mdadm -I -c $conf $dev2
 eval $tst
 echo 2000 > /proc/sys/dev/raid/speed_limit_max
+echo 1000 > /proc/sys/dev/raid/speed_limit_min

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: tests/03r5assemV1 issues
  2012-07-11  4:20         ` NeilBrown
@ 2012-07-11  4:28           ` Roman Mamedov
  2012-07-11  7:18           ` Jes Sorensen
  1 sibling, 0 replies; 8+ messages in thread
From: Roman Mamedov @ 2012-07-11  4:28 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jes Sorensen, linux-raid

[-- Attachment #1: Type: text/plain, Size: 4425 bytes --]

On Wed, 11 Jul 2012 14:20:53 +1000
NeilBrown <neilb@suse.de> wrote:

> On Fri, 06 Jul 2012 11:59:13 +0200 Jes Sorensen <Jes.Sorensen@redhat.com>
> wrote:
> 
> > NeilBrown <neilb@suse.de> writes:
> > > On Tue, 03 Jul 2012 18:07:02 +0200 Jes Sorensen <Jes.Sorensen@redhat.com>
> > > wrote:
> > >
> > >> NeilBrown <neilb@suse.de> writes:
> > >> > On Mon, 02 Jul 2012 15:24:43 +0200 Jes Sorensen <Jes.Sorensen@redhat.com>
> > >> > wrote:
> > >> >
> > >> >> Hi Neil,
> > >> >> 
> > >> >> I am trying to get the test suite stable on RHEL, but I see a lot of
> > >> >> failures in 03r5assemV1, in particular between these two cases:
> > >> >> 
> > >> >> mdadm -A $md1 -u $uuid $devlist
> > >> >> check state U_U
> > >> >> eval $tst
> > >> >> 
> > >> >> mdadm -A $md1 --name=one $devlist
> > >> >> check state U_U
> > >> >> check spares 1
> > >> >> eval $tst
> > >> >> 
> > >> >> I have tested it with the latest upstream kernel as well and see the
> > >> >> same problems. I suspect it is simply the box that is too fast, ending
> > >> >> up with the raid check completing inbetween the two test cases?
> > >> >> 
> > >> >> Are you seeing the same thing there? I tried playing with the max speed
> > >> >> variable but it doesn't really seem to make any difference.
> > >> >> 
> > >> >> Any ideas for what we can be done to make this case more resilient to
> > >> >> false positives? I guess one option would be to re-create the array
> > >> >> inbetween each test?
> > >> >
> > >> > Maybe it really is a bug?
> > >> > The test harness set the resync speed to be very slow.  A fast box will get
> > >> > through the test more quickly and be more likely to see the array still
> > >> > syncing.
> > >> >
> > >> > I'll try to make time to look more closely.
> > >> > But I wouldn't discount the possibility that the second "mdadm -A" is
> > >> > short-circuiting the recovery somehow.
> > >> 
> > >> That could certainly explain what I am seeing. I noticed it doesn't
> > >> happen every single time in the same place (from memory), but it is
> > >> mostly in that spot in my case.
> > >> 
> > >> Even if I trimmed the max speed down to 50 it still happens.
> > >
> > > I cannot easily reproduce this.
> > > Exactly which kernel and which mdadm do you find it with - just to make sure
> > > I'm testing the same thing as you?
> > 
> > Hi Neil,
> > 
> > Odd - I see it with
> > mdadm:  721b662b5b33830090c220bbb04bf1904d4b7eed
> > kernel: ca24a145573124732152daff105ba68cc9a2b545
> > 
> > I've seen this happen for a while fwiw.
> > 
> > Note the box has a number of external drives with a number of my scratch
> > raid arrays on it. It shouldn't affect this, but just in case.
> > 
> > The system installed mdadm is a 3.2.3 derivative, but I checked running
> > with PATH=. as well.
> 
> Thanks.
> I think I figured out what is happening.
> 
> It seems that setting the max_speed down to 1000 is often enough, but not
> always.  So we need to set it lower.
> But setting max_speed lowers is not effective unless you also set min_speed
> lower.  This is the tricky bit that took me way too long to realised.
> 
> So with this patch, it is quite reliable.
> 
> NeilBrown
> 
> diff --git a/tests/03r5assemV1 b/tests/03r5assemV1
> index 52b1107..bca0c58 100644
> --- a/tests/03r5assemV1
> +++ b/tests/03r5assemV1
> @@ -60,7 +60,8 @@ eval $tst
>  ### Now with a missing device
>  # We don't want the recovery to complete while we are
>  # messing about here.
> -echo 1000 > /proc/sys/dev/raid/speed_limit_max
> +echo 100 > /proc/sys/dev/raid/speed_limit_max
> +echo 100 > /proc/sys/dev/raid/speed_limit_min

Purely from an armchair perspective, don't you need to reduce 'min' first, and
only then lower 'max'? As it is currently, depending on the kernel side the
first "echo" has every right to fail with "Invalid argument" (or something
similar), if there'd be a check that max can not be lower than min.

>  
>  mdadm -AR $md1 $dev0 $dev2 $dev3 $dev4 #
>  check state U_U
> @@ -124,3 +125,4 @@ mdadm -I -c $conf $dev1
>  mdadm -I -c $conf $dev2
>  eval $tst
>  echo 2000 > /proc/sys/dev/raid/speed_limit_max
> +echo 1000 > /proc/sys/dev/raid/speed_limit_min


-- 
With respect,
Roman

~~~~~~~~~~~~~~~~~~~~~~~~~~~
"Stallman had a printer,
with code he could not see.
So he began to tinker,
and set the software free."

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: tests/03r5assemV1 issues
  2012-07-11  4:20         ` NeilBrown
  2012-07-11  4:28           ` Roman Mamedov
@ 2012-07-11  7:18           ` Jes Sorensen
  1 sibling, 0 replies; 8+ messages in thread
From: Jes Sorensen @ 2012-07-11  7:18 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

NeilBrown <neilb@suse.de> writes:
> On Fri, 06 Jul 2012 11:59:13 +0200 Jes Sorensen <Jes.Sorensen@redhat.com>
> wrote:
>> Hi Neil,
>> 
>> Odd - I see it with
>> mdadm:  721b662b5b33830090c220bbb04bf1904d4b7eed
>> kernel: ca24a145573124732152daff105ba68cc9a2b545
>> 
>> I've seen this happen for a while fwiw.
>> 
>> Note the box has a number of external drives with a number of my scratch
>> raid arrays on it. It shouldn't affect this, but just in case.
>> 
>> The system installed mdadm is a 3.2.3 derivative, but I checked running
>> with PATH=. as well.
>
> Thanks.
> I think I figured out what is happening.
>
> It seems that setting the max_speed down to 1000 is often enough, but not
> always.  So we need to set it lower.
> But setting max_speed lowers is not effective unless you also set min_speed
> lower.  This is the tricky bit that took me way too long to realised.
>
> So with this patch, it is quite reliable.

Hi Neil,

Just tried it out here, and it does indeed solve the problem for
me. Makes sense in the end :)

Looks like we need the same fix in tests/07reshape5intr

Thanks for figuring this out.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-07-11  7:18 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-02 13:24 tests/03r5assemV1 issues Jes Sorensen
2012-07-03  1:44 ` NeilBrown
2012-07-03 16:07   ` Jes Sorensen
2012-07-04  5:23     ` NeilBrown
2012-07-06  9:59       ` Jes Sorensen
2012-07-11  4:20         ` NeilBrown
2012-07-11  4:28           ` Roman Mamedov
2012-07-11  7:18           ` Jes Sorensen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.