All of lore.kernel.org
 help / color / mirror / Atom feed
* raid1_end_read_request does not retry failed READ from a recovering drive
@ 2014-09-07 14:18 Alexander Lyakas
  2014-09-08  7:17 ` NeilBrown
  0 siblings, 1 reply; 10+ messages in thread
From: Alexander Lyakas @ 2014-09-07 14:18 UTC (permalink / raw)
  To: linux-raid; +Cc: Neil Brown

Hi Neil,
we see the following issue:

# RAID1 has 2 drives A and B, drive B is recovering
# READ request arrives
# read_balanace selects drive B to read from, because READ sector
comes before B->recovery_offset
# READ is issued to drive B, but fails (drive B fails again)

Now raid1_end_read_request() has the following code:

    if (uptodate)
        set_bit(R1BIO_Uptodate, &r1_bio->state);
    else {
        /* If all other devices have failed, we want to return
         * the error upwards rather than fail the last device.
         * Here we redefine "uptodate" to mean "Don't want to retry"
         */
        unsigned long flags;
        spin_lock_irqsave(&conf->device_lock, flags);
        if (r1_bio->mddev->degraded == conf->raid_disks ||
            (r1_bio->mddev->degraded == conf->raid_disks-1 &&
             !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags)))
            uptodate = 1;
        spin_unlock_irqrestore(&conf->device_lock, flags);
    }

According to this code uptodate wrongly becomes 1, because:
r1_bio->mddev->degraded == conf->raid_disks-1 is TRUE
and
!test_bit(Faulty, &conf->mirrors[mirror].rdev->flags) is also TRUE

Indeed, drive B is not marked as Faulty, but also not marked as In_sync.
However, this function treats !Faulty being equal to In_Sync, so it
decides that the last good drive failed, so it does not retry the
READ.

As a result, there is IO error, while we should have retried the READ
from the healthy drive.

This is happening in 3.8.13, but your master branch seems to have the
same issue.

What is a reasonable fix?
1) Do not read from drives which are !In_sync (a bit scary to read
from such drive)
2) replace !Faulty to In_sync check
3) do both

 Can you pls advise?

Thanks,
Alex.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: raid1_end_read_request does not retry failed READ from a recovering drive
  2014-09-07 14:18 raid1_end_read_request does not retry failed READ from a recovering drive Alexander Lyakas
@ 2014-09-08  7:17 ` NeilBrown
  2014-09-17 17:57   ` Alexander Lyakas
  0 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2014-09-08  7:17 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2574 bytes --]

On Sun, 7 Sep 2014 17:18:16 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hi Neil,
> we see the following issue:
> 
> # RAID1 has 2 drives A and B, drive B is recovering
> # READ request arrives
> # read_balanace selects drive B to read from, because READ sector
> comes before B->recovery_offset
> # READ is issued to drive B, but fails (drive B fails again)
> 
> Now raid1_end_read_request() has the following code:
> 
>     if (uptodate)
>         set_bit(R1BIO_Uptodate, &r1_bio->state);
>     else {
>         /* If all other devices have failed, we want to return
>          * the error upwards rather than fail the last device.
>          * Here we redefine "uptodate" to mean "Don't want to retry"
>          */
>         unsigned long flags;
>         spin_lock_irqsave(&conf->device_lock, flags);
>         if (r1_bio->mddev->degraded == conf->raid_disks ||
>             (r1_bio->mddev->degraded == conf->raid_disks-1 &&
>              !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags)))
>             uptodate = 1;
>         spin_unlock_irqrestore(&conf->device_lock, flags);
>     }
> 
> According to this code uptodate wrongly becomes 1, because:
> r1_bio->mddev->degraded == conf->raid_disks-1 is TRUE
> and
> !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags) is also TRUE
> 
> Indeed, drive B is not marked as Faulty, but also not marked as In_sync.
> However, this function treats !Faulty being equal to In_Sync, so it
> decides that the last good drive failed, so it does not retry the
> READ.
> 
> As a result, there is IO error, while we should have retried the READ
> from the healthy drive.
> 
> This is happening in 3.8.13, but your master branch seems to have the
> same issue.
> 
> What is a reasonable fix?
> 1) Do not read from drives which are !In_sync (a bit scary to read
> from such drive)

It is perfectly safe to read from a !In_sync device providing you are before
->recovery_offset.


> 2) replace !Faulty to In_sync check

That probably makes sense... though that could race with raid1_spare_active().
If a read-error returned just after raid1_spare_active() set In_sync, and
before 'count' was subtracted from ->degraded, we would still set uptodate
when we shouldn't.
It probably make sense to put all of raid1_spare_active inside the spinlock -
it doesn't get call often enough that performance is an issue (I hope).

So:
 1/ change !Faulty to In_sync
 2/ extend the spinlock in raid1_spare_active to cover the whole function.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: raid1_end_read_request does not retry failed READ from a recovering drive
  2014-09-08  7:17 ` NeilBrown
@ 2014-09-17 17:57   ` Alexander Lyakas
  2014-09-18  1:05     ` NeilBrown
  0 siblings, 1 reply; 10+ messages in thread
From: Alexander Lyakas @ 2014-09-17 17:57 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,

On Mon, Sep 8, 2014 at 10:17 AM, NeilBrown <neilb@suse.de> wrote:
> On Sun, 7 Sep 2014 17:18:16 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Hi Neil,
>> we see the following issue:
>>
>> # RAID1 has 2 drives A and B, drive B is recovering
>> # READ request arrives
>> # read_balanace selects drive B to read from, because READ sector
>> comes before B->recovery_offset
>> # READ is issued to drive B, but fails (drive B fails again)
>>
>> Now raid1_end_read_request() has the following code:
>>
>>     if (uptodate)
>>         set_bit(R1BIO_Uptodate, &r1_bio->state);
>>     else {
>>         /* If all other devices have failed, we want to return
>>          * the error upwards rather than fail the last device.
>>          * Here we redefine "uptodate" to mean "Don't want to retry"
>>          */
>>         unsigned long flags;
>>         spin_lock_irqsave(&conf->device_lock, flags);
>>         if (r1_bio->mddev->degraded == conf->raid_disks ||
>>             (r1_bio->mddev->degraded == conf->raid_disks-1 &&
>>              !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags)))
>>             uptodate = 1;
>>         spin_unlock_irqrestore(&conf->device_lock, flags);
>>     }
>>
>> According to this code uptodate wrongly becomes 1, because:
>> r1_bio->mddev->degraded == conf->raid_disks-1 is TRUE
>> and
>> !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags) is also TRUE
>>
>> Indeed, drive B is not marked as Faulty, but also not marked as In_sync.
>> However, this function treats !Faulty being equal to In_Sync, so it
>> decides that the last good drive failed, so it does not retry the
>> READ.
>>
>> As a result, there is IO error, while we should have retried the READ
>> from the healthy drive.
>>
>> This is happening in 3.8.13, but your master branch seems to have the
>> same issue.
>>
>> What is a reasonable fix?
>> 1) Do not read from drives which are !In_sync (a bit scary to read
>> from such drive)
>
> It is perfectly safe to read from a !In_sync device providing you are before
> ->recovery_offset.
>
>
>> 2) replace !Faulty to In_sync check
>
> That probably makes sense... though that could race with raid1_spare_active().
> If a read-error returned just after raid1_spare_active() set In_sync, and
> before 'count' was subtracted from ->degraded, we would still set uptodate
> when we shouldn't.
> It probably make sense to put all of raid1_spare_active inside the spinlock -
> it doesn't get call often enough that performance is an issue (I hope).
>
> So:
>  1/ change !Faulty to In_sync
>  2/ extend the spinlock in raid1_spare_active to cover the whole function.


I made these fixes and reproduced the issue. However, the result is
not what we expect:

# raid1_end_read_request() now indeed adds the r1_bio into retry_list,
as we wanted
# raid1d calls fix_read_error()
# fix_read_error() searches for an In_sync drive to read the data
from. It finds such drive (this is our good drive A)
# now fix_read_error() wants to rewrite the bad area. But it rewrites
only on those drives that are In_sync (except the drive it got the
data from). In our case, it never tries to rewrite the data on drive B
(drive B is not marked Faulty and not marked In_sync). As a result,
md_error() is not called, so drive B is still not marked as Failed
when fix_read_error() completes
# so handle_read_error() retries the original READ by calling
read_balance(), which again in my case selects the recovering drive
B...

And then the whole flow repeats itself again and again...and READ
never completes.

Maybe we should not allow selecting recovering drives for READ? Or
some other approach?

Thanks,
Alex.




>
> Thanks,
> NeilBrown

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: raid1_end_read_request does not retry failed READ from a recovering drive
  2014-09-17 17:57   ` Alexander Lyakas
@ 2014-09-18  1:05     ` NeilBrown
       [not found]       ` <CAGRgLy7+bN8ztaaN+oh6uATE7=Z=8LnU=R0m2CnVff4ZqgJFKg@mail.gmail.com>
  0 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2014-09-18  1:05 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5028 bytes --]

On Wed, 17 Sep 2014 20:57:13 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hi Neil,
> 
> On Mon, Sep 8, 2014 at 10:17 AM, NeilBrown <neilb@suse.de> wrote:
> > On Sun, 7 Sep 2014 17:18:16 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> > wrote:
> >
> >> Hi Neil,
> >> we see the following issue:
> >>
> >> # RAID1 has 2 drives A and B, drive B is recovering
> >> # READ request arrives
> >> # read_balanace selects drive B to read from, because READ sector
> >> comes before B->recovery_offset
> >> # READ is issued to drive B, but fails (drive B fails again)
> >>
> >> Now raid1_end_read_request() has the following code:
> >>
> >>     if (uptodate)
> >>         set_bit(R1BIO_Uptodate, &r1_bio->state);
> >>     else {
> >>         /* If all other devices have failed, we want to return
> >>          * the error upwards rather than fail the last device.
> >>          * Here we redefine "uptodate" to mean "Don't want to retry"
> >>          */
> >>         unsigned long flags;
> >>         spin_lock_irqsave(&conf->device_lock, flags);
> >>         if (r1_bio->mddev->degraded == conf->raid_disks ||
> >>             (r1_bio->mddev->degraded == conf->raid_disks-1 &&
> >>              !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags)))
> >>             uptodate = 1;
> >>         spin_unlock_irqrestore(&conf->device_lock, flags);
> >>     }
> >>
> >> According to this code uptodate wrongly becomes 1, because:
> >> r1_bio->mddev->degraded == conf->raid_disks-1 is TRUE
> >> and
> >> !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags) is also TRUE
> >>
> >> Indeed, drive B is not marked as Faulty, but also not marked as In_sync.
> >> However, this function treats !Faulty being equal to In_Sync, so it
> >> decides that the last good drive failed, so it does not retry the
> >> READ.
> >>
> >> As a result, there is IO error, while we should have retried the READ
> >> from the healthy drive.
> >>
> >> This is happening in 3.8.13, but your master branch seems to have the
> >> same issue.
> >>
> >> What is a reasonable fix?
> >> 1) Do not read from drives which are !In_sync (a bit scary to read
> >> from such drive)
> >
> > It is perfectly safe to read from a !In_sync device providing you are before
> > ->recovery_offset.
> >
> >
> >> 2) replace !Faulty to In_sync check
> >
> > That probably makes sense... though that could race with raid1_spare_active().
> > If a read-error returned just after raid1_spare_active() set In_sync, and
> > before 'count' was subtracted from ->degraded, we would still set uptodate
> > when we shouldn't.
> > It probably make sense to put all of raid1_spare_active inside the spinlock -
> > it doesn't get call often enough that performance is an issue (I hope).
> >
> > So:
> >  1/ change !Faulty to In_sync
> >  2/ extend the spinlock in raid1_spare_active to cover the whole function.
> 
> 
> I made these fixes and reproduced the issue. However, the result is
> not what we expect:
> 
> # raid1_end_read_request() now indeed adds the r1_bio into retry_list,
> as we wanted
> # raid1d calls fix_read_error()
> # fix_read_error() searches for an In_sync drive to read the data
> from. It finds such drive (this is our good drive A)
> # now fix_read_error() wants to rewrite the bad area. But it rewrites
> only on those drives that are In_sync (except the drive it got the
> data from). In our case, it never tries to rewrite the data on drive B
> (drive B is not marked Faulty and not marked In_sync). As a result,
> md_error() is not called, so drive B is still not marked as Failed
> when fix_read_error() completes
> # so handle_read_error() retries the original READ by calling
> read_balance(), which again in my case selects the recovering drive
> B...
> 
> And then the whole flow repeats itself again and again...and READ
> never completes.
> 
> Maybe we should not allow selecting recovering drives for READ? Or
> some other approach?
> 

Thanks for the testing and analysis.
Presumably we just want handle_read_error() to write to all non-faulty
devices, not just the InSync ones.
i.e. the following patch.

Thanks,
NeilBrown

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 6a9c73435eb8..a95f9e179e6f 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2153,7 +2153,7 @@ static void fix_read_error(struct r1conf *conf, int read_disk,
 			d--;
 			rdev = conf->mirrors[d].rdev;
 			if (rdev &&
-			    test_bit(In_sync, &rdev->flags))
+			    !test_bit(Faulty, &rdev->flags))
 				r1_sync_page_io(rdev, sect, s,
 						conf->tmppage, WRITE);
 		}
@@ -2165,7 +2165,7 @@ static void fix_read_error(struct r1conf *conf, int read_disk,
 			d--;
 			rdev = conf->mirrors[d].rdev;
 			if (rdev &&
-			    test_bit(In_sync, &rdev->flags)) {
+			    !test_bit(Faulty, &rdev->flags)) {
 				if (r1_sync_page_io(rdev, sect, s,
 						    conf->tmppage, READ)) {
 					atomic_add(s, &rdev->corrected_errors);

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: raid1_end_read_request does not retry failed READ from a recovering drive
       [not found]         ` <20140922101704.53be2056@notabene.brown>
@ 2015-07-22 16:10           ` Alexander Lyakas
  2015-07-23 23:24             ` NeilBrown
  0 siblings, 1 reply; 10+ messages in thread
From: Alexander Lyakas @ 2015-07-22 16:10 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,
In continuation of our discussion, I see that you have added a
commit[1], which has a diff[2].
But this is only the second part of the fix. We also need the first
part, I believe, where in raid1_end_read_request we need to replace
"!test_bit(Faulty)" with "test_bit(In_sync)". Otherwise, we will never
retry the READ, and thus will never reach the fix_read_error code.
And we also need to "put all of raid1_spare_active inside the
spinlock", like you advised.
Do you agree?

We tested both parts of the fix, not the second part alone. Quoting myself:
"With this addition, the problem appears to be fixed", i.e., I meant
we applied both parts.

I am sorry for catching this so late. I never looked at what you
applied until now, because we are moving to kernel 3.18 long term, and
I am checking the raid1 changes. I just assumed you applied both
parts.

If you agree, it would be good if you tag the first part of the fix as
"cc stable" too.

Thanks,
Alex.



[1]
commit b8cb6b4c121e1bf1963c16ed69e7adcb1bc301cd
Author: NeilBrown <neilb@suse.de>
Date:   Thu Sep 18 11:09:04 2014 +1000

    md/raid1: fix_read_error should act on all non-faulty devices.

    If a devices is being recovered it is not InSync and is not Faulty.

    If a read error is experienced on that device, fix_read_error()
    will be called, but it ignores non-InSync devices.  So it will
    neither fix the error nor fail the device.

    It is incorrect that fix_read_error() ignores non-InSync devices.
    It should only ignore Faulty devices.  So fix it.

    This became a bug when we allowed reading from a device that was being
    recovered.  It is suitable for any subsequent -stable kernel.

    Fixes: da8840a747c0dbf49506ec906757a6b87b9741e9
    Cc: stable@vger.kernel.org (v3.5+)
    Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
    Tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
    Signed-off-by: NeilBrown <neilb@suse.de>

[2]
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 35649dd..55de4f6 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2155,7 +2155,7 @@ static void fix_read_error(struct r1conf *conf,
int read_disk,
                        d--;
                        rdev = conf->mirrors[d].rdev;
                        if (rdev &&
-                           test_bit(In_sync, &rdev->flags))
+                           !test_bit(Faulty, &rdev->flags))
                                r1_sync_page_io(rdev, sect, s,
                                                conf->tmppage, WRITE);
                }
@@ -2167,7 +2167,7 @@ static void fix_read_error(struct r1conf *conf,
int read_disk,
                        d--;
                        rdev = conf->mirrors[d].rdev;
                        if (rdev &&
-                           test_bit(In_sync, &rdev->flags)) {
+                           !test_bit(Faulty, &rdev->flags)) {
                                if (r1_sync_page_io(rdev, sect, s,
                                                    conf->tmppage, READ)) {
                                        atomic_add(s, &rdev->corrected_errors);

On Mon, Sep 22, 2014 at 2:17 AM, NeilBrown <neilb@suse.de> wrote:
> On Sun, 21 Sep 2014 19:47:14 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Thanks, Neil,
>>
>> On Thu, Sep 18, 2014 at 4:05 AM, NeilBrown <neilb@suse.de> wrote:
>> > On Wed, 17 Sep 2014 20:57:13 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
>> > wrote:
>> >
>> >> Hi Neil,
>> >>
>> >> On Mon, Sep 8, 2014 at 10:17 AM, NeilBrown <neilb@suse.de> wrote:
>> >> > On Sun, 7 Sep 2014 17:18:16 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
>> >> > wrote:
>> >> >
>> >> >> Hi Neil,
>> >> >> we see the following issue:
>> >> >>
>> >> >> # RAID1 has 2 drives A and B, drive B is recovering
>> >> >> # READ request arrives
>> >> >> # read_balanace selects drive B to read from, because READ sector
>> >> >> comes before B->recovery_offset
>> >> >> # READ is issued to drive B, but fails (drive B fails again)
>> >> >>
>> >> >> Now raid1_end_read_request() has the following code:
>> >> >>
>> >> >>     if (uptodate)
>> >> >>         set_bit(R1BIO_Uptodate, &r1_bio->state);
>> >> >>     else {
>> >> >>         /* If all other devices have failed, we want to return
>> >> >>          * the error upwards rather than fail the last device.
>> >> >>          * Here we redefine "uptodate" to mean "Don't want to retry"
>> >> >>          */
>> >> >>         unsigned long flags;
>> >> >>         spin_lock_irqsave(&conf->device_lock, flags);
>> >> >>         if (r1_bio->mddev->degraded == conf->raid_disks ||
>> >> >>             (r1_bio->mddev->degraded == conf->raid_disks-1 &&
>> >> >>              !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags)))
>> >> >>             uptodate = 1;
>> >> >>         spin_unlock_irqrestore(&conf->device_lock, flags);
>> >> >>     }
>> >> >>
>> >> >> According to this code uptodate wrongly becomes 1, because:
>> >> >> r1_bio->mddev->degraded == conf->raid_disks-1 is TRUE
>> >> >> and
>> >> >> !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags) is also TRUE
>> >> >>
>> >> >> Indeed, drive B is not marked as Faulty, but also not marked as In_sync.
>> >> >> However, this function treats !Faulty being equal to In_Sync, so it
>> >> >> decides that the last good drive failed, so it does not retry the
>> >> >> READ.
>> >> >>
>> >> >> As a result, there is IO error, while we should have retried the READ
>> >> >> from the healthy drive.
>> >> >>
>> >> >> This is happening in 3.8.13, but your master branch seems to have the
>> >> >> same issue.
>> >> >>
>> >> >> What is a reasonable fix?
>> >> >> 1) Do not read from drives which are !In_sync (a bit scary to read
>> >> >> from such drive)
>> >> >
>> >> > It is perfectly safe to read from a !In_sync device providing you are before
>> >> > ->recovery_offset.
>> >> >
>> >> >
>> >> >> 2) replace !Faulty to In_sync check
>> >> >
>> >> > That probably makes sense... though that could race with raid1_spare_active().
>> >> > If a read-error returned just after raid1_spare_active() set In_sync, and
>> >> > before 'count' was subtracted from ->degraded, we would still set uptodate
>> >> > when we shouldn't.
>> >> > It probably make sense to put all of raid1_spare_active inside the spinlock -
>> >> > it doesn't get call often enough that performance is an issue (I hope).
>> >> >
>> >> > So:
>> >> >  1/ change !Faulty to In_sync
>> >> >  2/ extend the spinlock in raid1_spare_active to cover the whole function.
>> >>
>> >>
>> >> I made these fixes and reproduced the issue. However, the result is
>> >> not what we expect:
>> >>
>> >> # raid1_end_read_request() now indeed adds the r1_bio into retry_list,
>> >> as we wanted
>> >> # raid1d calls fix_read_error()
>> >> # fix_read_error() searches for an In_sync drive to read the data
>> >> from. It finds such drive (this is our good drive A)
>> >> # now fix_read_error() wants to rewrite the bad area. But it rewrites
>> >> only on those drives that are In_sync (except the drive it got the
>> >> data from). In our case, it never tries to rewrite the data on drive B
>> >> (drive B is not marked Faulty and not marked In_sync). As a result,
>> >> md_error() is not called, so drive B is still not marked as Failed
>> >> when fix_read_error() completes
>> >> # so handle_read_error() retries the original READ by calling
>> >> read_balance(), which again in my case selects the recovering drive
>> >> B...
>> >>
>> >> And then the whole flow repeats itself again and again...and READ
>> >> never completes.
>> >>
>> >> Maybe we should not allow selecting recovering drives for READ? Or
>> >> some other approach?
>> >>
>> >
>> > Thanks for the testing and analysis.
>> > Presumably we just want handle_read_error() to write to all non-faulty
>> > devices, not just the InSync ones.
>> > i.e. the following patch.
>> >
>> > Thanks,
>> > NeilBrown
>> >
>> > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> > index 6a9c73435eb8..a95f9e179e6f 100644
>> > --- a/drivers/md/raid1.c
>> > +++ b/drivers/md/raid1.c
>> > @@ -2153,7 +2153,7 @@ static void fix_read_error(struct r1conf *conf, int read_disk,
>> >                         d--;
>> >                         rdev = conf->mirrors[d].rdev;
>> >                         if (rdev &&
>> > -                           test_bit(In_sync, &rdev->flags))
>> > +                           !test_bit(Faulty, &rdev->flags))
>> >                                 r1_sync_page_io(rdev, sect, s,
>> >                                                 conf->tmppage, WRITE);
>> >                 }
>> > @@ -2165,7 +2165,7 @@ static void fix_read_error(struct r1conf *conf, int read_disk,
>> >                         d--;
>> >                         rdev = conf->mirrors[d].rdev;
>> >                         if (rdev &&
>> > -                           test_bit(In_sync, &rdev->flags)) {
>> > +                           !test_bit(Faulty, &rdev->flags)) {
>> >                                 if (r1_sync_page_io(rdev, sect, s,
>> >                                                     conf->tmppage, READ)) {
>> >                                         atomic_add(s, &rdev->corrected_errors);
>>
>> With this addition, the problem appears to be fixed. We will give it
>> some regression testing & will let you know if we see any issues.
>>
>> I presume you will be applying this fix upstream as well.
>
> Yes, it is already in my for-next branch.
> I've just added your tested-by.
>
> Thanks,
> NeilBrown

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: raid1_end_read_request does not retry failed READ from a recovering drive
  2015-07-22 16:10           ` Alexander Lyakas
@ 2015-07-23 23:24             ` NeilBrown
  2015-07-26  8:15               ` Alexander Lyakas
  0 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2015-07-23 23:24 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

On Wed, 22 Jul 2015 18:10:31 +0200 Alexander Lyakas
<alex.bolshoy@gmail.com> wrote:

> Hi Neil,
> In continuation of our discussion, I see that you have added a
> commit[1], which has a diff[2].
> But this is only the second part of the fix. We also need the first
> part, I believe, where in raid1_end_read_request we need to replace
> "!test_bit(Faulty)" with "test_bit(In_sync)". Otherwise, we will never
> retry the READ, and thus will never reach the fix_read_error code.
> And we also need to "put all of raid1_spare_active inside the
> spinlock", like you advised.
> Do you agree?
> 
> We tested both parts of the fix, not the second part alone. Quoting myself:
> "With this addition, the problem appears to be fixed", i.e., I meant
> we applied both parts.
> 
> I am sorry for catching this so late. I never looked at what you
> applied until now, because we are moving to kernel 3.18 long term, and
> I am checking the raid1 changes. I just assumed you applied both
> parts.
> 
> If you agree, it would be good if you tag the first part of the fix as
> "cc stable" too.
> 
> Thanks,
> Alex.
> 
> 
> 
> [1]
> commit b8cb6b4c121e1bf1963c16ed69e7adcb1bc301cd
> Author: NeilBrown <neilb@suse.de>
> Date:   Thu Sep 18 11:09:04 2014 +1000
> 
>     md/raid1: fix_read_error should act on all non-faulty devices.
> 
>     If a devices is being recovered it is not InSync and is not Faulty.
> 
>     If a read error is experienced on that device, fix_read_error()
>     will be called, but it ignores non-InSync devices.  So it will
>     neither fix the error nor fail the device.
> 
>     It is incorrect that fix_read_error() ignores non-InSync devices.
>     It should only ignore Faulty devices.  So fix it.
> 
>     This became a bug when we allowed reading from a device that was being
>     recovered.  It is suitable for any subsequent -stable kernel.
> 
>     Fixes: da8840a747c0dbf49506ec906757a6b87b9741e9
>     Cc: stable@vger.kernel.org (v3.5+)
>     Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
>     Tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
>     Signed-off-by: NeilBrown <neilb@suse.de>
> 
> [2]
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 35649dd..55de4f6 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2155,7 +2155,7 @@ static void fix_read_error(struct r1conf *conf,
> int read_disk,
>                         d--;
>                         rdev = conf->mirrors[d].rdev;
>                         if (rdev &&
> -                           test_bit(In_sync, &rdev->flags))
> +                           !test_bit(Faulty, &rdev->flags))
>                                 r1_sync_page_io(rdev, sect, s,
>                                                 conf->tmppage, WRITE);
>                 }
> @@ -2167,7 +2167,7 @@ static void fix_read_error(struct r1conf *conf,
> int read_disk,
>                         d--;
>                         rdev = conf->mirrors[d].rdev;
>                         if (rdev &&
> -                           test_bit(In_sync, &rdev->flags)) {
> +                           !test_bit(Faulty, &rdev->flags)) {
>                                 if (r1_sync_page_io(rdev, sect, s,
>                                                     conf->tmppage, READ)) {
>                                         atomic_add(s, &rdev->corrected_errors);
> 
> On Mon, Sep 22, 2014 at 2:17 AM, NeilBrown <neilb@suse.de> wrote:
> > On Sun, 21 Sep 2014 19:47:14 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> > wrote:
> >
> >> Thanks, Neil,
> >>
> >> On Thu, Sep 18, 2014 at 4:05 AM, NeilBrown <neilb@suse.de> wrote:
> >> > On Wed, 17 Sep 2014 20:57:13 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> >> > wrote:
> >> >
> >> >> Hi Neil,
> >> >>
> >> >> On Mon, Sep 8, 2014 at 10:17 AM, NeilBrown <neilb@suse.de> wrote:
> >> >> > On Sun, 7 Sep 2014 17:18:16 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> >> >> > wrote:
> >> >> >
> >> >> >> Hi Neil,
> >> >> >> we see the following issue:
> >> >> >>
> >> >> >> # RAID1 has 2 drives A and B, drive B is recovering
> >> >> >> # READ request arrives
> >> >> >> # read_balanace selects drive B to read from, because READ sector
> >> >> >> comes before B->recovery_offset
> >> >> >> # READ is issued to drive B, but fails (drive B fails again)
> >> >> >>
> >> >> >> Now raid1_end_read_request() has the following code:
> >> >> >>
> >> >> >>     if (uptodate)
> >> >> >>         set_bit(R1BIO_Uptodate, &r1_bio->state);
> >> >> >>     else {
> >> >> >>         /* If all other devices have failed, we want to return
> >> >> >>          * the error upwards rather than fail the last device.
> >> >> >>          * Here we redefine "uptodate" to mean "Don't want to retry"
> >> >> >>          */
> >> >> >>         unsigned long flags;
> >> >> >>         spin_lock_irqsave(&conf->device_lock, flags);
> >> >> >>         if (r1_bio->mddev->degraded == conf->raid_disks ||
> >> >> >>             (r1_bio->mddev->degraded == conf->raid_disks-1 &&
> >> >> >>              !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags)))
> >> >> >>             uptodate = 1;
> >> >> >>         spin_unlock_irqrestore(&conf->device_lock, flags);
> >> >> >>     }
> >> >> >>
> >> >> >> According to this code uptodate wrongly becomes 1, because:
> >> >> >> r1_bio->mddev->degraded == conf->raid_disks-1 is TRUE
> >> >> >> and
> >> >> >> !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags) is also TRUE
> >> >> >>
> >> >> >> Indeed, drive B is not marked as Faulty, but also not marked as In_sync.
> >> >> >> However, this function treats !Faulty being equal to In_Sync, so it
> >> >> >> decides that the last good drive failed, so it does not retry the
> >> >> >> READ.
> >> >> >>
> >> >> >> As a result, there is IO error, while we should have retried the READ
> >> >> >> from the healthy drive.
> >> >> >>
> >> >> >> This is happening in 3.8.13, but your master branch seems to have the
> >> >> >> same issue.
> >> >> >>
> >> >> >> What is a reasonable fix?
> >> >> >> 1) Do not read from drives which are !In_sync (a bit scary to read
> >> >> >> from such drive)
> >> >> >
> >> >> > It is perfectly safe to read from a !In_sync device providing you are before
> >> >> > ->recovery_offset.
> >> >> >
> >> >> >
> >> >> >> 2) replace !Faulty to In_sync check
> >> >> >
> >> >> > That probably makes sense... though that could race with raid1_spare_active().
> >> >> > If a read-error returned just after raid1_spare_active() set In_sync, and
> >> >> > before 'count' was subtracted from ->degraded, we would still set uptodate
> >> >> > when we shouldn't.
> >> >> > It probably make sense to put all of raid1_spare_active inside the spinlock -
> >> >> > it doesn't get call often enough that performance is an issue (I hope).
> >> >> >
> >> >> > So:
> >> >> >  1/ change !Faulty to In_sync
> >> >> >  2/ extend the spinlock in raid1_spare_active to cover the whole function.
> >> >>
> >> >>
> >> >> I made these fixes and reproduced the issue. However, the result is
> >> >> not what we expect:
> >> >>
> >> >> # raid1_end_read_request() now indeed adds the r1_bio into retry_list,
> >> >> as we wanted
> >> >> # raid1d calls fix_read_error()
> >> >> # fix_read_error() searches for an In_sync drive to read the data
> >> >> from. It finds such drive (this is our good drive A)
> >> >> # now fix_read_error() wants to rewrite the bad area. But it rewrites
> >> >> only on those drives that are In_sync (except the drive it got the
> >> >> data from). In our case, it never tries to rewrite the data on drive B
> >> >> (drive B is not marked Faulty and not marked In_sync). As a result,
> >> >> md_error() is not called, so drive B is still not marked as Failed
> >> >> when fix_read_error() completes
> >> >> # so handle_read_error() retries the original READ by calling
> >> >> read_balance(), which again in my case selects the recovering drive
> >> >> B...
> >> >>
> >> >> And then the whole flow repeats itself again and again...and READ
> >> >> never completes.
> >> >>
> >> >> Maybe we should not allow selecting recovering drives for READ? Or
> >> >> some other approach?
> >> >>
> >> >
> >> > Thanks for the testing and analysis.
> >> > Presumably we just want handle_read_error() to write to all non-faulty
> >> > devices, not just the InSync ones.
> >> > i.e. the following patch.
> >> >
> >> > Thanks,
> >> > NeilBrown
> >> >
> >> > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> >> > index 6a9c73435eb8..a95f9e179e6f 100644
> >> > --- a/drivers/md/raid1.c
> >> > +++ b/drivers/md/raid1.c
> >> > @@ -2153,7 +2153,7 @@ static void fix_read_error(struct r1conf *conf, int read_disk,
> >> >                         d--;
> >> >                         rdev = conf->mirrors[d].rdev;
> >> >                         if (rdev &&
> >> > -                           test_bit(In_sync, &rdev->flags))
> >> > +                           !test_bit(Faulty, &rdev->flags))
> >> >                                 r1_sync_page_io(rdev, sect, s,
> >> >                                                 conf->tmppage, WRITE);
> >> >                 }
> >> > @@ -2165,7 +2165,7 @@ static void fix_read_error(struct r1conf *conf, int read_disk,
> >> >                         d--;
> >> >                         rdev = conf->mirrors[d].rdev;
> >> >                         if (rdev &&
> >> > -                           test_bit(In_sync, &rdev->flags)) {
> >> > +                           !test_bit(Faulty, &rdev->flags)) {
> >> >                                 if (r1_sync_page_io(rdev, sect, s,
> >> >                                                     conf->tmppage, READ)) {
> >> >                                         atomic_add(s, &rdev->corrected_errors);
> >>
> >> With this addition, the problem appears to be fixed. We will give it
> >> some regression testing & will let you know if we see any issues.
> >>
> >> I presume you will be applying this fix upstream as well.
> >
> > Yes, it is already in my for-next branch.
> > I've just added your tested-by.
> >
> > Thanks,
> > NeilBrown

Hi Alex
thanks for noticing!
Just to be sure we mean the same thing: this is the patch which is
missing - correct?

Thanks,
NeilBrown

From: NeilBrown <neilb@suse.com>
Date: Fri, 24 Jul 2015 09:22:16 +1000
Subject: [PATCH] md/raid1: fix test for 'was read error from last working
 device'.

When we get a read error from the last working device, we don't
try to repair it, and don't fail the device.  We simple report a
read error to the caller.

However the current test for 'is this the last working device' is
wrong.
When there is only one fully working device, it assumes that a
non-faulty device is that device.  However a spare which is rebuilding
would be non-faulty but so not the only working device.

So change the test from "!Faulty" to "In_sync".  If ->degraded says
there is only one fully working device and this device is in_sync,
this must be the one.

This bug has existed since we allowed read_balance to read from
a recovering spare in v3.0

Reported-and-tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Fixes: 76073054c95b ("md/raid1: clean up read_balance.")
Cc: stable@vger.kernel.org (v3.0+)
Signed-off-by: NeilBrown <neilb@suse.com>

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 166616411215..b368307a9651 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -336,7 +336,7 @@ static void raid1_end_read_request(struct bio *bio, int error)
 		spin_lock_irqsave(&conf->device_lock, flags);
 		if (r1_bio->mddev->degraded == conf->raid_disks ||
 		    (r1_bio->mddev->degraded == conf->raid_disks-1 &&
-		     !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags)))
+		     test_bit(In_sync, &conf->mirrors[mirror].rdev->flags)))
 			uptodate = 1;
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 	}

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: raid1_end_read_request does not retry failed READ from a recovering drive
  2015-07-23 23:24             ` NeilBrown
@ 2015-07-26  8:15               ` Alexander Lyakas
  2015-07-27  1:56                 ` NeilBrown
  0 siblings, 1 reply; 10+ messages in thread
From: Alexander Lyakas @ 2015-07-26  8:15 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,

On Fri, Jul 24, 2015 at 1:24 AM, NeilBrown <neilb@suse.com> wrote:
> Hi Alex
> thanks for noticing!
> Just to be sure we mean the same thing: this is the patch which is
> missing - correct?
>
> Thanks,
> NeilBrown
>
> From: NeilBrown <neilb@suse.com>
> Date: Fri, 24 Jul 2015 09:22:16 +1000
> Subject: [PATCH] md/raid1: fix test for 'was read error from last working
>  device'.
>
> When we get a read error from the last working device, we don't
> try to repair it, and don't fail the device.  We simple report a
> read error to the caller.
>
> However the current test for 'is this the last working device' is
> wrong.
> When there is only one fully working device, it assumes that a
> non-faulty device is that device.  However a spare which is rebuilding
> would be non-faulty but so not the only working device.
>
> So change the test from "!Faulty" to "In_sync".  If ->degraded says
> there is only one fully working device and this device is in_sync,
> this must be the one.
>
> This bug has existed since we allowed read_balance to read from
> a recovering spare in v3.0
>
> Reported-and-tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
> Fixes: 76073054c95b ("md/raid1: clean up read_balance.")
> Cc: stable@vger.kernel.org (v3.0+)
> Signed-off-by: NeilBrown <neilb@suse.com>
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 166616411215..b368307a9651 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -336,7 +336,7 @@ static void raid1_end_read_request(struct bio *bio, int error)
>                 spin_lock_irqsave(&conf->device_lock, flags);
>                 if (r1_bio->mddev->degraded == conf->raid_disks ||
>                     (r1_bio->mddev->degraded == conf->raid_disks-1 &&
> -                    !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags)))
> +                    test_bit(In_sync, &conf->mirrors[mirror].rdev->flags)))
>                         uptodate = 1;
>                 spin_unlock_irqrestore(&conf->device_lock, flags);
>         }

Yes, this was the missing piece.

But you also advised to put the whole of "raid1_spare_active" under
the spinlock:
> 2/ extend the spinlock in raid1_spare_active to cover the whole function.
So we did this bit too. Can you pls comment if this is needed?

Also, will you be sending this patch to "stable"? We are moving to
kernel 3.18, so we should get this then.

Thanks!
Alex.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: raid1_end_read_request does not retry failed READ from a recovering drive
  2015-07-26  8:15               ` Alexander Lyakas
@ 2015-07-27  1:56                 ` NeilBrown
  2015-07-27  8:07                   ` Alexander Lyakas
  0 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2015-07-27  1:56 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

On Sun, 26 Jul 2015 10:15:05 +0200 Alexander Lyakas
<alex.bolshoy@gmail.com> wrote:

> Hi Neil,
> 
> On Fri, Jul 24, 2015 at 1:24 AM, NeilBrown <neilb@suse.com> wrote:
> > Hi Alex
> > thanks for noticing!
> > Just to be sure we mean the same thing: this is the patch which is
> > missing - correct?
> >
> > Thanks,
> > NeilBrown
> >
> > From: NeilBrown <neilb@suse.com>
> > Date: Fri, 24 Jul 2015 09:22:16 +1000
> > Subject: [PATCH] md/raid1: fix test for 'was read error from last working
> >  device'.
> >
> > When we get a read error from the last working device, we don't
> > try to repair it, and don't fail the device.  We simple report a
> > read error to the caller.
> >
> > However the current test for 'is this the last working device' is
> > wrong.
> > When there is only one fully working device, it assumes that a
> > non-faulty device is that device.  However a spare which is rebuilding
> > would be non-faulty but so not the only working device.
> >
> > So change the test from "!Faulty" to "In_sync".  If ->degraded says
> > there is only one fully working device and this device is in_sync,
> > this must be the one.
> >
> > This bug has existed since we allowed read_balance to read from
> > a recovering spare in v3.0
> >
> > Reported-and-tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
> > Fixes: 76073054c95b ("md/raid1: clean up read_balance.")
> > Cc: stable@vger.kernel.org (v3.0+)
> > Signed-off-by: NeilBrown <neilb@suse.com>
> >
> > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> > index 166616411215..b368307a9651 100644
> > --- a/drivers/md/raid1.c
> > +++ b/drivers/md/raid1.c
> > @@ -336,7 +336,7 @@ static void raid1_end_read_request(struct bio *bio, int error)
> >                 spin_lock_irqsave(&conf->device_lock, flags);
> >                 if (r1_bio->mddev->degraded == conf->raid_disks ||
> >                     (r1_bio->mddev->degraded == conf->raid_disks-1 &&
> > -                    !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags)))
> > +                    test_bit(In_sync, &conf->mirrors[mirror].rdev->flags)))
> >                         uptodate = 1;
> >                 spin_unlock_irqrestore(&conf->device_lock, flags);
> >         }
> 
> Yes, this was the missing piece.
> 
> But you also advised to put the whole of "raid1_spare_active" under
> the spinlock:
> > 2/ extend the spinlock in raid1_spare_active to cover the whole function.
> So we did this bit too. Can you pls comment if this is needed?

When you say "we did this bit" it would help a lot to actually show the
patch - helps avoid misunderstanding.  In fact I was probably hoping
that you would post a patch once you had it fixed.... no mater.
See below for that patch I have just queue.  There is an extra place
were a spinlock is probably helpful.

> 
> Also, will you be sending this patch to "stable"? We are moving to
> kernel 3.18, so we should get this then.

Putting "cc: stable@vger.kernel.org" means that it will automatically
migrated to the stable kernels once it has appeared in Linus' kernel.
So yes: it will go to -stable

> 
> Thanks!
> Alex.

Thanks,
NeilBrown

From 04f58ef505b9d354dd06c477a94a7e314a38cb72 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.com>
Date: Mon, 27 Jul 2015 11:48:52 +1000
Subject: [PATCH] md/raid1: extend spinlock to protect raid1_end_read_request
 against inconsistencies

raid1_end_read_request() assumes that the In_sync bits are consistent
with the ->degaded count.
raid1_spare_active updates the In_sync bit before the ->degraded count
and so exposes an inconsistency, as does error()
So extend the spinlock in raid1_spare_active() and error() to hide those
inconsistencies.

This should probably be part of
  Commit: 34cab6f42003 ("md/raid1: fix test for 'was read error from
  last working device'.")
as it addresses the same issue.  It fixes the same bug and should go
to -stable for same reasons.

Fixes: 76073054c95b ("md/raid1: clean up read_balance.")
Cc: stable@vger.kernel.org (v3.0+)
Signed-off-by: NeilBrown <neilb@suse.com>

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index b368307a9651..742b50794dfd 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1476,6 +1476,7 @@ static void error(struct mddev *mddev, struct md_rdev *rdev)
 {
 	char b[BDEVNAME_SIZE];
 	struct r1conf *conf = mddev->private;
+	unsigned long flags;
 
 	/*
 	 * If it is not operational, then we have already marked it as dead
@@ -1495,14 +1496,13 @@ static void error(struct mddev *mddev, struct md_rdev *rdev)
 		return;
 	}
 	set_bit(Blocked, &rdev->flags);
+	spin_lock_irqsave(&conf->device_lock, flags);
 	if (test_and_clear_bit(In_sync, &rdev->flags)) {
-		unsigned long flags;
-		spin_lock_irqsave(&conf->device_lock, flags);
 		mddev->degraded++;
 		set_bit(Faulty, &rdev->flags);
-		spin_unlock_irqrestore(&conf->device_lock, flags);
 	} else
 		set_bit(Faulty, &rdev->flags);
+	spin_unlock_irqrestore(&conf->device_lock, flags);
 	/*
 	 * if recovery is running, make sure it aborts.
 	 */
@@ -1568,7 +1568,10 @@ static int raid1_spare_active(struct mddev *mddev)
 	 * Find all failed disks within the RAID1 configuration
 	 * and mark them readable.
 	 * Called under mddev lock, so rcu protection not needed.
+	 * device_lock used to avoid races with raid1_end_read_request
+	 * which expects 'In_sync' flags and ->degraded to be consistent.
 	 */
+	spin_lock_irqsave(&conf->device_lock, flags);
 	for (i = 0; i < conf->raid_disks; i++) {
 		struct md_rdev *rdev = conf->mirrors[i].rdev;
 		struct md_rdev *repl = conf->mirrors[conf->raid_disks + i].rdev;
@@ -1599,7 +1602,6 @@ static int raid1_spare_active(struct mddev *mddev)
 			sysfs_notify_dirent_safe(rdev->sysfs_state);
 		}
 	}
-	spin_lock_irqsave(&conf->device_lock, flags);
 	mddev->degraded -= count;
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: raid1_end_read_request does not retry failed READ from a recovering drive
  2015-07-27  1:56                 ` NeilBrown
@ 2015-07-27  8:07                   ` Alexander Lyakas
  2015-07-31  0:12                     ` NeilBrown
  0 siblings, 1 reply; 10+ messages in thread
From: Alexander Lyakas @ 2015-07-27  8:07 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,
Thanks for the comments. Hopefully now we have the complete fix.
I am posting what we have applied on top of 3.8.13 (now we are moving
to 3.18.19, which already has part of the fix, so I will need to apply
a delta, until both your latest patches reach Mr. Stable). Locking the
spinlock in "error" function is not there (but I will apply it to
3.18.19).

Below patch is a bit ugly:
- CONFIG_MD_ZADARA is a define that we add, so that every engineer can
clearly distinguish our changes vs the original code. I know it's
ugly.
- zklog is a macro that eventually ends up in printk. This is only to
have a bit more prints, and is not functionally needed. zklog also
prints current->pid, function name etc, to have more context.

Hopefully, gmail will not make the patch even uglier.

Thanks for your help!
Alex.


diff --git a/md/3.8.13-030813-generic/drivers/md/raid1.c
b/md/3.8.13-030813-generic/drivers/md/raid1.c
index c864a6e..32170e9 100644
--- a/md/3.8.13-030813-generic/drivers/md/raid1.c
+++ b/md/3.8.13-030813-generic/drivers/md/raid1.c
@@ -322,24 +322,36 @@ static void raid1_end_read_request(struct bio
*bio, int error)

     if (uptodate)
         set_bit(R1BIO_Uptodate, &r1_bio->state);
     else {
         /* If all other devices have failed, we want to return
          * the error upwards rather than fail the last device.
          * Here we redefine "uptodate" to mean "Don't want to retry"
          */
         unsigned long flags;
         spin_lock_irqsave(&conf->device_lock, flags);
+#ifndef CONFIG_MD_ZADARA
         if (r1_bio->mddev->degraded == conf->raid_disks ||
             (r1_bio->mddev->degraded == conf->raid_disks-1 &&
              !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags)))
             uptodate = 1;
+#else/*CONFIG_MD_ZADARA*/
+        /* Make sure we retry read error from a recovering device */
+        if (r1_bio->mddev->degraded == conf->raid_disks ||
+            (r1_bio->mddev->degraded == conf->raid_disks-1 &&
+             test_bit(In_sync, &conf->mirrors[mirror].rdev->flags))) {
+            char b[BDEVNAME_SIZE];
+            zklog_rl(mdname(conf->mddev), Z_KERR, "READ
rdev=%s[%llu:%u] ERROR - not retrying (last good drive)",
+                     bdevname(conf->mirrors[mirror].rdev->bdev, b),
(unsigned long long)r1_bio->sector, r1_bio->sectors);
+            uptodate = 1;
+        }
+#endif/*CONFIG_MD_ZADARA*/
         spin_unlock_irqrestore(&conf->device_lock, flags);
     }

     if (uptodate) {
         raid_end_bio_io(r1_bio);
         rdev_dec_pending(conf->mirrors[mirror].rdev, conf->mddev);
     } else {
         /*
          * oops, read error:
          */
@@ -1418,20 +1430,31 @@ static void status(struct seq_file *seq,
struct mddev *mddev)
     rcu_read_unlock();
     seq_printf(seq, "]");
 }


 static void error(struct mddev *mddev, struct md_rdev *rdev)
 {
     char b[BDEVNAME_SIZE];
     struct r1conf *conf = mddev->private;

+#ifdef CONFIG_MD_ZADARA
+    {
+        static DEFINE_RATELIMIT_STATE(_rs,
DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
+        if (__ratelimit(&_rs)) {
+            zklog(mdname(mddev), Z_KERR, "rdev=%s failure InSync=%u
Failed=%u STACK:",
+                  bdevname(rdev->bdev, b), test_bit(In_sync,
&rdev->flags), test_bit(Faulty, &rdev->flags));
+            dump_stack();
+        }
+    }
+#endif/*CONFIG_MD_ZADARA*/
+
     /*
      * If it is not operational, then we have already marked it as dead
      * else if it is the last working disks, ignore the error, let the
      * next level up know.
      * else mark the drive as failed
      */
     if (test_bit(In_sync, &rdev->flags)
         && (conf->raid_disks - mddev->degraded) == 1) {
         /*
          * Don't fail the drive, act as though we were just a
@@ -1497,20 +1520,25 @@ static void close_sync(struct r1conf *conf)
     conf->r1buf_pool = NULL;
 }

 static int raid1_spare_active(struct mddev *mddev)
 {
     int i;
     struct r1conf *conf = mddev->private;
     int count = 0;
     unsigned long flags;

+#ifdef CONFIG_MD_ZADARA
+    /* we lock the whole function */
+    spin_lock_irqsave(&conf->device_lock, flags);
+#endif /*CONFIG_MD_ZADARA*/
+
     /*
      * Find all failed disks within the RAID1 configuration
      * and mark them readable.
      * Called under mddev lock, so rcu protection not needed.
      */
     for (i = 0; i < conf->raid_disks; i++) {
         struct md_rdev *rdev = conf->mirrors[i].rdev;
         struct md_rdev *repl = conf->mirrors[conf->raid_disks + i].rdev;
         if (repl
             && repl->recovery_offset == MaxSector
@@ -1530,21 +1558,24 @@ static int raid1_spare_active(struct mddev *mddev)
                     rdev->sysfs_state);
             }
         }
         if (rdev
             && !test_bit(Faulty, &rdev->flags)
             && !test_and_set_bit(In_sync, &rdev->flags)) {
             count++;
             sysfs_notify_dirent_safe(rdev->sysfs_state);
         }
     }
+#ifndef CONFIG_MD_ZADARA
+    /* we lock the whole function */
     spin_lock_irqsave(&conf->device_lock, flags);
+#endif /*CONFIG_MD_ZADARA*/
     mddev->degraded -= count;
     spin_unlock_irqrestore(&conf->device_lock, flags);

     print_conf(conf);
     return count;
 }


 static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 {
@@ -2027,20 +2058,27 @@ static void sync_request_write(struct mddev
*mddev, struct r1bio *r1_bio)
  *
  *    1.    Retries failed read operations on working mirrors.
  *    2.    Updates the raid superblock when problems encounter.
  *    3.    Performs writes following reads for array synchronising.
  */

 static void fix_read_error(struct r1conf *conf, int read_disk,
                sector_t sect, int sectors)
 {
     struct mddev *mddev = conf->mddev;
+#ifdef CONFIG_MD_ZADARA
+    {
+        char b[BDEVNAME_SIZE] = {'\0'};
+        zklog_rl(mdname(mddev), Z_KWARN, "attempting to fix READ
rdev=%s[%llu:%u]",
+                 bdevname(conf->mirrors[read_disk].rdev->bdev, b),
(unsigned long long)sect, sectors);
+    }
+#endif /*CONFIG_MD_ZADARA*/
     while(sectors) {
         int s = sectors;
         int d = read_disk;
         int success = 0;
         int start;
         struct md_rdev *rdev;

         if (s > (PAGE_SIZE>>9))
             s = PAGE_SIZE >> 9;

@@ -2078,33 +2116,41 @@ static void fix_read_error(struct r1conf
*conf, int read_disk,
             break;
         }
         /* write it back and re-read */
         start = d;
         while (d != read_disk) {
             if (d==0)
                 d = conf->raid_disks * 2;
             d--;
             rdev = conf->mirrors[d].rdev;
             if (rdev &&
+#ifndef CONFIG_MD_ZADARA
                 test_bit(In_sync, &rdev->flags))
+#else /*CONFIG_MD_ZADARA*/
+                !test_bit(Faulty, &rdev->flags))
+#endif /*CONFIG_MD_ZADARA*/
                 r1_sync_page_io(rdev, sect, s,
                         conf->tmppage, WRITE);
         }
         d = start;
         while (d != read_disk) {
             char b[BDEVNAME_SIZE];
             if (d==0)
                 d = conf->raid_disks * 2;
             d--;
             rdev = conf->mirrors[d].rdev;
             if (rdev &&
+#ifndef CONFIG_MD_ZADARA
                 test_bit(In_sync, &rdev->flags)) {
+#else /*CONFIG_MD_ZADARA*/
+                !test_bit(Faulty, &rdev->flags)) {
+#endif /*CONFIG_MD_ZADARA*/
                 if (r1_sync_page_io(rdev, sect, s,
                             conf->tmppage, READ)) {
                     atomic_add(s, &rdev->corrected_errors);
                     printk(KERN_INFO
                            "md/raid1:%s: read error corrected "
                            "(%d sectors at %llu on %s)\n",
                            mdname(mddev), s,
                            (unsigned long long)(sect +
                                rdev->data_offset),
                            bdevname(rdev->bdev, b));



On Mon, Jul 27, 2015 at 3:56 AM, NeilBrown <neilb@suse.com> wrote:
> On Sun, 26 Jul 2015 10:15:05 +0200 Alexander Lyakas
> <alex.bolshoy@gmail.com> wrote:
>
>> Hi Neil,
>>
>> On Fri, Jul 24, 2015 at 1:24 AM, NeilBrown <neilb@suse.com> wrote:
>> > Hi Alex
>> > thanks for noticing!
>> > Just to be sure we mean the same thing: this is the patch which is
>> > missing - correct?
>> >
>> > Thanks,
>> > NeilBrown
>> >
>> > From: NeilBrown <neilb@suse.com>
>> > Date: Fri, 24 Jul 2015 09:22:16 +1000
>> > Subject: [PATCH] md/raid1: fix test for 'was read error from last working
>> >  device'.
>> >
>> > When we get a read error from the last working device, we don't
>> > try to repair it, and don't fail the device.  We simple report a
>> > read error to the caller.
>> >
>> > However the current test for 'is this the last working device' is
>> > wrong.
>> > When there is only one fully working device, it assumes that a
>> > non-faulty device is that device.  However a spare which is rebuilding
>> > would be non-faulty but so not the only working device.
>> >
>> > So change the test from "!Faulty" to "In_sync".  If ->degraded says
>> > there is only one fully working device and this device is in_sync,
>> > this must be the one.
>> >
>> > This bug has existed since we allowed read_balance to read from
>> > a recovering spare in v3.0
>> >
>> > Reported-and-tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
>> > Fixes: 76073054c95b ("md/raid1: clean up read_balance.")
>> > Cc: stable@vger.kernel.org (v3.0+)
>> > Signed-off-by: NeilBrown <neilb@suse.com>
>> >
>> > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> > index 166616411215..b368307a9651 100644
>> > --- a/drivers/md/raid1.c
>> > +++ b/drivers/md/raid1.c
>> > @@ -336,7 +336,7 @@ static void raid1_end_read_request(struct bio *bio, int error)
>> >                 spin_lock_irqsave(&conf->device_lock, flags);
>> >                 if (r1_bio->mddev->degraded == conf->raid_disks ||
>> >                     (r1_bio->mddev->degraded == conf->raid_disks-1 &&
>> > -                    !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags)))
>> > +                    test_bit(In_sync, &conf->mirrors[mirror].rdev->flags)))
>> >                         uptodate = 1;
>> >                 spin_unlock_irqrestore(&conf->device_lock, flags);
>> >         }
>>
>> Yes, this was the missing piece.
>>
>> But you also advised to put the whole of "raid1_spare_active" under
>> the spinlock:
>> > 2/ extend the spinlock in raid1_spare_active to cover the whole function.
>> So we did this bit too. Can you pls comment if this is needed?
>
> When you say "we did this bit" it would help a lot to actually show the
> patch - helps avoid misunderstanding.  In fact I was probably hoping
> that you would post a patch once you had it fixed.... no mater.
> See below for that patch I have just queue.  There is an extra place
> were a spinlock is probably helpful.
>
>>
>> Also, will you be sending this patch to "stable"? We are moving to
>> kernel 3.18, so we should get this then.
>
> Putting "cc: stable@vger.kernel.org" means that it will automatically
> migrated to the stable kernels once it has appeared in Linus' kernel.
> So yes: it will go to -stable
>
>>
>> Thanks!
>> Alex.
>
> Thanks,
> NeilBrown
>
> From 04f58ef505b9d354dd06c477a94a7e314a38cb72 Mon Sep 17 00:00:00 2001
> From: NeilBrown <neilb@suse.com>
> Date: Mon, 27 Jul 2015 11:48:52 +1000
> Subject: [PATCH] md/raid1: extend spinlock to protect raid1_end_read_request
>  against inconsistencies
>
> raid1_end_read_request() assumes that the In_sync bits are consistent
> with the ->degaded count.
> raid1_spare_active updates the In_sync bit before the ->degraded count
> and so exposes an inconsistency, as does error()
> So extend the spinlock in raid1_spare_active() and error() to hide those
> inconsistencies.
>
> This should probably be part of
>   Commit: 34cab6f42003 ("md/raid1: fix test for 'was read error from
>   last working device'.")
> as it addresses the same issue.  It fixes the same bug and should go
> to -stable for same reasons.
>
> Fixes: 76073054c95b ("md/raid1: clean up read_balance.")
> Cc: stable@vger.kernel.org (v3.0+)
> Signed-off-by: NeilBrown <neilb@suse.com>
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index b368307a9651..742b50794dfd 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1476,6 +1476,7 @@ static void error(struct mddev *mddev, struct md_rdev *rdev)
>  {
>         char b[BDEVNAME_SIZE];
>         struct r1conf *conf = mddev->private;
> +       unsigned long flags;
>
>         /*
>          * If it is not operational, then we have already marked it as dead
> @@ -1495,14 +1496,13 @@ static void error(struct mddev *mddev, struct md_rdev *rdev)
>                 return;
>         }
>         set_bit(Blocked, &rdev->flags);
> +       spin_lock_irqsave(&conf->device_lock, flags);
>         if (test_and_clear_bit(In_sync, &rdev->flags)) {
> -               unsigned long flags;
> -               spin_lock_irqsave(&conf->device_lock, flags);
>                 mddev->degraded++;
>                 set_bit(Faulty, &rdev->flags);
> -               spin_unlock_irqrestore(&conf->device_lock, flags);
>         } else
>                 set_bit(Faulty, &rdev->flags);
> +       spin_unlock_irqrestore(&conf->device_lock, flags);
>         /*
>          * if recovery is running, make sure it aborts.
>          */
> @@ -1568,7 +1568,10 @@ static int raid1_spare_active(struct mddev *mddev)
>          * Find all failed disks within the RAID1 configuration
>          * and mark them readable.
>          * Called under mddev lock, so rcu protection not needed.
> +        * device_lock used to avoid races with raid1_end_read_request
> +        * which expects 'In_sync' flags and ->degraded to be consistent.
>          */
> +       spin_lock_irqsave(&conf->device_lock, flags);
>         for (i = 0; i < conf->raid_disks; i++) {
>                 struct md_rdev *rdev = conf->mirrors[i].rdev;
>                 struct md_rdev *repl = conf->mirrors[conf->raid_disks + i].rdev;
> @@ -1599,7 +1602,6 @@ static int raid1_spare_active(struct mddev *mddev)
>                         sysfs_notify_dirent_safe(rdev->sysfs_state);
>                 }
>         }
> -       spin_lock_irqsave(&conf->device_lock, flags);
>         mddev->degraded -= count;
>         spin_unlock_irqrestore(&conf->device_lock, flags);
>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: raid1_end_read_request does not retry failed READ from a recovering drive
  2015-07-27  8:07                   ` Alexander Lyakas
@ 2015-07-31  0:12                     ` NeilBrown
  0 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2015-07-31  0:12 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

On Mon, 27 Jul 2015 10:07:37 +0200 Alexander Lyakas
<alex.bolshoy@gmail.com> wrote:

> Hi Neil,
> Thanks for the comments. Hopefully now we have the complete fix.
> I am posting what we have applied on top of 3.8.13 (now we are moving
> to 3.18.19, which already has part of the fix, so I will need to apply
> a delta, until both your latest patches reach Mr. Stable). Locking the
> spinlock in "error" function is not there (but I will apply it to
> 3.18.19).
> 
> Below patch is a bit ugly:
> - CONFIG_MD_ZADARA is a define that we add, so that every engineer can
> clearly distinguish our changes vs the original code. I know it's
> ugly.
> - zklog is a macro that eventually ends up in printk. This is only to
> have a bit more prints, and is not functionally needed. zklog also
> prints current->pid, function name etc, to have more context.
> 
> Hopefully, gmail will not make the patch even uglier.
> 
> Thanks for your help!
> Alex.
> 

Thanks for that Alex - I think we are one the same page now :-)
Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-07-31  0:12 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-07 14:18 raid1_end_read_request does not retry failed READ from a recovering drive Alexander Lyakas
2014-09-08  7:17 ` NeilBrown
2014-09-17 17:57   ` Alexander Lyakas
2014-09-18  1:05     ` NeilBrown
     [not found]       ` <CAGRgLy7+bN8ztaaN+oh6uATE7=Z=8LnU=R0m2CnVff4ZqgJFKg@mail.gmail.com>
     [not found]         ` <20140922101704.53be2056@notabene.brown>
2015-07-22 16:10           ` Alexander Lyakas
2015-07-23 23:24             ` NeilBrown
2015-07-26  8:15               ` Alexander Lyakas
2015-07-27  1:56                 ` NeilBrown
2015-07-27  8:07                   ` Alexander Lyakas
2015-07-31  0:12                     ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.