* tests/03r5assemV1 issues @ 2012-07-02 13:24 Jes Sorensen 2012-07-03 1:44 ` NeilBrown 0 siblings, 1 reply; 8+ messages in thread From: Jes Sorensen @ 2012-07-02 13:24 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Hi Neil, I am trying to get the test suite stable on RHEL, but I see a lot of failures in 03r5assemV1, in particular between these two cases: mdadm -A $md1 -u $uuid $devlist check state U_U eval $tst mdadm -A $md1 --name=one $devlist check state U_U check spares 1 eval $tst I have tested it with the latest upstream kernel as well and see the same problems. I suspect it is simply the box that is too fast, ending up with the raid check completing inbetween the two test cases? Are you seeing the same thing there? I tried playing with the max speed variable but it doesn't really seem to make any difference. Any ideas for what we can be done to make this case more resilient to false positives? I guess one option would be to re-create the array inbetween each test? Cheers, Jes ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: tests/03r5assemV1 issues 2012-07-02 13:24 tests/03r5assemV1 issues Jes Sorensen @ 2012-07-03 1:44 ` NeilBrown 2012-07-03 16:07 ` Jes Sorensen 0 siblings, 1 reply; 8+ messages in thread From: NeilBrown @ 2012-07-03 1:44 UTC (permalink / raw) To: Jes Sorensen; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 1326 bytes --] On Mon, 02 Jul 2012 15:24:43 +0200 Jes Sorensen <Jes.Sorensen@redhat.com> wrote: > Hi Neil, > > I am trying to get the test suite stable on RHEL, but I see a lot of > failures in 03r5assemV1, in particular between these two cases: > > mdadm -A $md1 -u $uuid $devlist > check state U_U > eval $tst > > mdadm -A $md1 --name=one $devlist > check state U_U > check spares 1 > eval $tst > > I have tested it with the latest upstream kernel as well and see the > same problems. I suspect it is simply the box that is too fast, ending > up with the raid check completing inbetween the two test cases? > > Are you seeing the same thing there? I tried playing with the max speed > variable but it doesn't really seem to make any difference. > > Any ideas for what we can be done to make this case more resilient to > false positives? I guess one option would be to re-create the array > inbetween each test? > > Cheers, > Jes Maybe it really is a bug? The test harness set the resync speed to be very slow. A fast box will get through the test more quickly and be more likely to see the array still syncing. I'll try to make time to look more closely. But I wouldn't discount the possibility that the second "mdadm -A" is short-circuiting the recovery somehow. thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: tests/03r5assemV1 issues 2012-07-03 1:44 ` NeilBrown @ 2012-07-03 16:07 ` Jes Sorensen 2012-07-04 5:23 ` NeilBrown 0 siblings, 1 reply; 8+ messages in thread From: Jes Sorensen @ 2012-07-03 16:07 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid NeilBrown <neilb@suse.de> writes: > On Mon, 02 Jul 2012 15:24:43 +0200 Jes Sorensen <Jes.Sorensen@redhat.com> > wrote: > >> Hi Neil, >> >> I am trying to get the test suite stable on RHEL, but I see a lot of >> failures in 03r5assemV1, in particular between these two cases: >> >> mdadm -A $md1 -u $uuid $devlist >> check state U_U >> eval $tst >> >> mdadm -A $md1 --name=one $devlist >> check state U_U >> check spares 1 >> eval $tst >> >> I have tested it with the latest upstream kernel as well and see the >> same problems. I suspect it is simply the box that is too fast, ending >> up with the raid check completing inbetween the two test cases? >> >> Are you seeing the same thing there? I tried playing with the max speed >> variable but it doesn't really seem to make any difference. >> >> Any ideas for what we can be done to make this case more resilient to >> false positives? I guess one option would be to re-create the array >> inbetween each test? > > Maybe it really is a bug? > The test harness set the resync speed to be very slow. A fast box will get > through the test more quickly and be more likely to see the array still > syncing. > > I'll try to make time to look more closely. > But I wouldn't discount the possibility that the second "mdadm -A" is > short-circuiting the recovery somehow. That could certainly explain what I am seeing. I noticed it doesn't happen every single time in the same place (from memory), but it is mostly in that spot in my case. Even if I trimmed the max speed down to 50 it still happens. Cheers, Jes ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: tests/03r5assemV1 issues 2012-07-03 16:07 ` Jes Sorensen @ 2012-07-04 5:23 ` NeilBrown 2012-07-06 9:59 ` Jes Sorensen 0 siblings, 1 reply; 8+ messages in thread From: NeilBrown @ 2012-07-04 5:23 UTC (permalink / raw) To: Jes Sorensen; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 1941 bytes --] On Tue, 03 Jul 2012 18:07:02 +0200 Jes Sorensen <Jes.Sorensen@redhat.com> wrote: > NeilBrown <neilb@suse.de> writes: > > On Mon, 02 Jul 2012 15:24:43 +0200 Jes Sorensen <Jes.Sorensen@redhat.com> > > wrote: > > > >> Hi Neil, > >> > >> I am trying to get the test suite stable on RHEL, but I see a lot of > >> failures in 03r5assemV1, in particular between these two cases: > >> > >> mdadm -A $md1 -u $uuid $devlist > >> check state U_U > >> eval $tst > >> > >> mdadm -A $md1 --name=one $devlist > >> check state U_U > >> check spares 1 > >> eval $tst > >> > >> I have tested it with the latest upstream kernel as well and see the > >> same problems. I suspect it is simply the box that is too fast, ending > >> up with the raid check completing inbetween the two test cases? > >> > >> Are you seeing the same thing there? I tried playing with the max speed > >> variable but it doesn't really seem to make any difference. > >> > >> Any ideas for what we can be done to make this case more resilient to > >> false positives? I guess one option would be to re-create the array > >> inbetween each test? > > > > Maybe it really is a bug? > > The test harness set the resync speed to be very slow. A fast box will get > > through the test more quickly and be more likely to see the array still > > syncing. > > > > I'll try to make time to look more closely. > > But I wouldn't discount the possibility that the second "mdadm -A" is > > short-circuiting the recovery somehow. > > That could certainly explain what I am seeing. I noticed it doesn't > happen every single time in the same place (from memory), but it is > mostly in that spot in my case. > > Even if I trimmed the max speed down to 50 it still happens. I cannot easily reproduce this. Exactly which kernel and which mdadm do you find it with - just to make sure I'm testing the same thing as you? Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: tests/03r5assemV1 issues 2012-07-04 5:23 ` NeilBrown @ 2012-07-06 9:59 ` Jes Sorensen 2012-07-11 4:20 ` NeilBrown 0 siblings, 1 reply; 8+ messages in thread From: Jes Sorensen @ 2012-07-06 9:59 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid NeilBrown <neilb@suse.de> writes: > On Tue, 03 Jul 2012 18:07:02 +0200 Jes Sorensen <Jes.Sorensen@redhat.com> > wrote: > >> NeilBrown <neilb@suse.de> writes: >> > On Mon, 02 Jul 2012 15:24:43 +0200 Jes Sorensen <Jes.Sorensen@redhat.com> >> > wrote: >> > >> >> Hi Neil, >> >> >> >> I am trying to get the test suite stable on RHEL, but I see a lot of >> >> failures in 03r5assemV1, in particular between these two cases: >> >> >> >> mdadm -A $md1 -u $uuid $devlist >> >> check state U_U >> >> eval $tst >> >> >> >> mdadm -A $md1 --name=one $devlist >> >> check state U_U >> >> check spares 1 >> >> eval $tst >> >> >> >> I have tested it with the latest upstream kernel as well and see the >> >> same problems. I suspect it is simply the box that is too fast, ending >> >> up with the raid check completing inbetween the two test cases? >> >> >> >> Are you seeing the same thing there? I tried playing with the max speed >> >> variable but it doesn't really seem to make any difference. >> >> >> >> Any ideas for what we can be done to make this case more resilient to >> >> false positives? I guess one option would be to re-create the array >> >> inbetween each test? >> > >> > Maybe it really is a bug? >> > The test harness set the resync speed to be very slow. A fast box will get >> > through the test more quickly and be more likely to see the array still >> > syncing. >> > >> > I'll try to make time to look more closely. >> > But I wouldn't discount the possibility that the second "mdadm -A" is >> > short-circuiting the recovery somehow. >> >> That could certainly explain what I am seeing. I noticed it doesn't >> happen every single time in the same place (from memory), but it is >> mostly in that spot in my case. >> >> Even if I trimmed the max speed down to 50 it still happens. > > I cannot easily reproduce this. > Exactly which kernel and which mdadm do you find it with - just to make sure > I'm testing the same thing as you? Hi Neil, Odd - I see it with mdadm: 721b662b5b33830090c220bbb04bf1904d4b7eed kernel: ca24a145573124732152daff105ba68cc9a2b545 I've seen this happen for a while fwiw. Note the box has a number of external drives with a number of my scratch raid arrays on it. It shouldn't affect this, but just in case. The system installed mdadm is a 3.2.3 derivative, but I checked running with PATH=. as well. Cheers, Jes ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: tests/03r5assemV1 issues 2012-07-06 9:59 ` Jes Sorensen @ 2012-07-11 4:20 ` NeilBrown 2012-07-11 4:28 ` Roman Mamedov 2012-07-11 7:18 ` Jes Sorensen 0 siblings, 2 replies; 8+ messages in thread From: NeilBrown @ 2012-07-11 4:20 UTC (permalink / raw) To: Jes Sorensen; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 3677 bytes --] On Fri, 06 Jul 2012 11:59:13 +0200 Jes Sorensen <Jes.Sorensen@redhat.com> wrote: > NeilBrown <neilb@suse.de> writes: > > On Tue, 03 Jul 2012 18:07:02 +0200 Jes Sorensen <Jes.Sorensen@redhat.com> > > wrote: > > > >> NeilBrown <neilb@suse.de> writes: > >> > On Mon, 02 Jul 2012 15:24:43 +0200 Jes Sorensen <Jes.Sorensen@redhat.com> > >> > wrote: > >> > > >> >> Hi Neil, > >> >> > >> >> I am trying to get the test suite stable on RHEL, but I see a lot of > >> >> failures in 03r5assemV1, in particular between these two cases: > >> >> > >> >> mdadm -A $md1 -u $uuid $devlist > >> >> check state U_U > >> >> eval $tst > >> >> > >> >> mdadm -A $md1 --name=one $devlist > >> >> check state U_U > >> >> check spares 1 > >> >> eval $tst > >> >> > >> >> I have tested it with the latest upstream kernel as well and see the > >> >> same problems. I suspect it is simply the box that is too fast, ending > >> >> up with the raid check completing inbetween the two test cases? > >> >> > >> >> Are you seeing the same thing there? I tried playing with the max speed > >> >> variable but it doesn't really seem to make any difference. > >> >> > >> >> Any ideas for what we can be done to make this case more resilient to > >> >> false positives? I guess one option would be to re-create the array > >> >> inbetween each test? > >> > > >> > Maybe it really is a bug? > >> > The test harness set the resync speed to be very slow. A fast box will get > >> > through the test more quickly and be more likely to see the array still > >> > syncing. > >> > > >> > I'll try to make time to look more closely. > >> > But I wouldn't discount the possibility that the second "mdadm -A" is > >> > short-circuiting the recovery somehow. > >> > >> That could certainly explain what I am seeing. I noticed it doesn't > >> happen every single time in the same place (from memory), but it is > >> mostly in that spot in my case. > >> > >> Even if I trimmed the max speed down to 50 it still happens. > > > > I cannot easily reproduce this. > > Exactly which kernel and which mdadm do you find it with - just to make sure > > I'm testing the same thing as you? > > Hi Neil, > > Odd - I see it with > mdadm: 721b662b5b33830090c220bbb04bf1904d4b7eed > kernel: ca24a145573124732152daff105ba68cc9a2b545 > > I've seen this happen for a while fwiw. > > Note the box has a number of external drives with a number of my scratch > raid arrays on it. It shouldn't affect this, but just in case. > > The system installed mdadm is a 3.2.3 derivative, but I checked running > with PATH=. as well. Thanks. I think I figured out what is happening. It seems that setting the max_speed down to 1000 is often enough, but not always. So we need to set it lower. But setting max_speed lowers is not effective unless you also set min_speed lower. This is the tricky bit that took me way too long to realised. So with this patch, it is quite reliable. NeilBrown diff --git a/tests/03r5assemV1 b/tests/03r5assemV1 index 52b1107..bca0c58 100644 --- a/tests/03r5assemV1 +++ b/tests/03r5assemV1 @@ -60,7 +60,8 @@ eval $tst ### Now with a missing device # We don't want the recovery to complete while we are # messing about here. -echo 1000 > /proc/sys/dev/raid/speed_limit_max +echo 100 > /proc/sys/dev/raid/speed_limit_max +echo 100 > /proc/sys/dev/raid/speed_limit_min mdadm -AR $md1 $dev0 $dev2 $dev3 $dev4 # check state U_U @@ -124,3 +125,4 @@ mdadm -I -c $conf $dev1 mdadm -I -c $conf $dev2 eval $tst echo 2000 > /proc/sys/dev/raid/speed_limit_max +echo 1000 > /proc/sys/dev/raid/speed_limit_min [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: tests/03r5assemV1 issues 2012-07-11 4:20 ` NeilBrown @ 2012-07-11 4:28 ` Roman Mamedov 2012-07-11 7:18 ` Jes Sorensen 1 sibling, 0 replies; 8+ messages in thread From: Roman Mamedov @ 2012-07-11 4:28 UTC (permalink / raw) To: NeilBrown; +Cc: Jes Sorensen, linux-raid [-- Attachment #1: Type: text/plain, Size: 4425 bytes --] On Wed, 11 Jul 2012 14:20:53 +1000 NeilBrown <neilb@suse.de> wrote: > On Fri, 06 Jul 2012 11:59:13 +0200 Jes Sorensen <Jes.Sorensen@redhat.com> > wrote: > > > NeilBrown <neilb@suse.de> writes: > > > On Tue, 03 Jul 2012 18:07:02 +0200 Jes Sorensen <Jes.Sorensen@redhat.com> > > > wrote: > > > > > >> NeilBrown <neilb@suse.de> writes: > > >> > On Mon, 02 Jul 2012 15:24:43 +0200 Jes Sorensen <Jes.Sorensen@redhat.com> > > >> > wrote: > > >> > > > >> >> Hi Neil, > > >> >> > > >> >> I am trying to get the test suite stable on RHEL, but I see a lot of > > >> >> failures in 03r5assemV1, in particular between these two cases: > > >> >> > > >> >> mdadm -A $md1 -u $uuid $devlist > > >> >> check state U_U > > >> >> eval $tst > > >> >> > > >> >> mdadm -A $md1 --name=one $devlist > > >> >> check state U_U > > >> >> check spares 1 > > >> >> eval $tst > > >> >> > > >> >> I have tested it with the latest upstream kernel as well and see the > > >> >> same problems. I suspect it is simply the box that is too fast, ending > > >> >> up with the raid check completing inbetween the two test cases? > > >> >> > > >> >> Are you seeing the same thing there? I tried playing with the max speed > > >> >> variable but it doesn't really seem to make any difference. > > >> >> > > >> >> Any ideas for what we can be done to make this case more resilient to > > >> >> false positives? I guess one option would be to re-create the array > > >> >> inbetween each test? > > >> > > > >> > Maybe it really is a bug? > > >> > The test harness set the resync speed to be very slow. A fast box will get > > >> > through the test more quickly and be more likely to see the array still > > >> > syncing. > > >> > > > >> > I'll try to make time to look more closely. > > >> > But I wouldn't discount the possibility that the second "mdadm -A" is > > >> > short-circuiting the recovery somehow. > > >> > > >> That could certainly explain what I am seeing. I noticed it doesn't > > >> happen every single time in the same place (from memory), but it is > > >> mostly in that spot in my case. > > >> > > >> Even if I trimmed the max speed down to 50 it still happens. > > > > > > I cannot easily reproduce this. > > > Exactly which kernel and which mdadm do you find it with - just to make sure > > > I'm testing the same thing as you? > > > > Hi Neil, > > > > Odd - I see it with > > mdadm: 721b662b5b33830090c220bbb04bf1904d4b7eed > > kernel: ca24a145573124732152daff105ba68cc9a2b545 > > > > I've seen this happen for a while fwiw. > > > > Note the box has a number of external drives with a number of my scratch > > raid arrays on it. It shouldn't affect this, but just in case. > > > > The system installed mdadm is a 3.2.3 derivative, but I checked running > > with PATH=. as well. > > Thanks. > I think I figured out what is happening. > > It seems that setting the max_speed down to 1000 is often enough, but not > always. So we need to set it lower. > But setting max_speed lowers is not effective unless you also set min_speed > lower. This is the tricky bit that took me way too long to realised. > > So with this patch, it is quite reliable. > > NeilBrown > > diff --git a/tests/03r5assemV1 b/tests/03r5assemV1 > index 52b1107..bca0c58 100644 > --- a/tests/03r5assemV1 > +++ b/tests/03r5assemV1 > @@ -60,7 +60,8 @@ eval $tst > ### Now with a missing device > # We don't want the recovery to complete while we are > # messing about here. > -echo 1000 > /proc/sys/dev/raid/speed_limit_max > +echo 100 > /proc/sys/dev/raid/speed_limit_max > +echo 100 > /proc/sys/dev/raid/speed_limit_min Purely from an armchair perspective, don't you need to reduce 'min' first, and only then lower 'max'? As it is currently, depending on the kernel side the first "echo" has every right to fail with "Invalid argument" (or something similar), if there'd be a check that max can not be lower than min. > > mdadm -AR $md1 $dev0 $dev2 $dev3 $dev4 # > check state U_U > @@ -124,3 +125,4 @@ mdadm -I -c $conf $dev1 > mdadm -I -c $conf $dev2 > eval $tst > echo 2000 > /proc/sys/dev/raid/speed_limit_max > +echo 1000 > /proc/sys/dev/raid/speed_limit_min -- With respect, Roman ~~~~~~~~~~~~~~~~~~~~~~~~~~~ "Stallman had a printer, with code he could not see. So he began to tinker, and set the software free." [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: tests/03r5assemV1 issues 2012-07-11 4:20 ` NeilBrown 2012-07-11 4:28 ` Roman Mamedov @ 2012-07-11 7:18 ` Jes Sorensen 1 sibling, 0 replies; 8+ messages in thread From: Jes Sorensen @ 2012-07-11 7:18 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid NeilBrown <neilb@suse.de> writes: > On Fri, 06 Jul 2012 11:59:13 +0200 Jes Sorensen <Jes.Sorensen@redhat.com> > wrote: >> Hi Neil, >> >> Odd - I see it with >> mdadm: 721b662b5b33830090c220bbb04bf1904d4b7eed >> kernel: ca24a145573124732152daff105ba68cc9a2b545 >> >> I've seen this happen for a while fwiw. >> >> Note the box has a number of external drives with a number of my scratch >> raid arrays on it. It shouldn't affect this, but just in case. >> >> The system installed mdadm is a 3.2.3 derivative, but I checked running >> with PATH=. as well. > > Thanks. > I think I figured out what is happening. > > It seems that setting the max_speed down to 1000 is often enough, but not > always. So we need to set it lower. > But setting max_speed lowers is not effective unless you also set min_speed > lower. This is the tricky bit that took me way too long to realised. > > So with this patch, it is quite reliable. Hi Neil, Just tried it out here, and it does indeed solve the problem for me. Makes sense in the end :) Looks like we need the same fix in tests/07reshape5intr Thanks for figuring this out. Cheers, Jes ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2012-07-11 7:18 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-07-02 13:24 tests/03r5assemV1 issues Jes Sorensen 2012-07-03 1:44 ` NeilBrown 2012-07-03 16:07 ` Jes Sorensen 2012-07-04 5:23 ` NeilBrown 2012-07-06 9:59 ` Jes Sorensen 2012-07-11 4:20 ` NeilBrown 2012-07-11 4:28 ` Roman Mamedov 2012-07-11 7:18 ` Jes Sorensen
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.