All of lore.kernel.org
 help / color / mirror / Atom feed
* Last working drive in RAID1
@ 2015-03-04 19:55 Eric Mei
  2015-03-04 21:46 ` NeilBrown
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Mei @ 2015-03-04 19:55 UTC (permalink / raw)
  To: linux-raid

Hi,

It is interesting to notice that RAID1 won't mark the last working drive 
as Faulty no matter what. The responsible code seems here:

static void error(struct mddev *mddev, struct md_rdev *rdev)
{
         ...
         /*
          * If it is not operational, then we have already marked it as dead
          * else if it is the last working disks, ignore the error, let the
          * next level up know.
          * else mark the drive as failed
          */
         if (test_bit(In_sync, &rdev->flags)
             && (conf->raid_disks - mddev->degraded) == 1) {
                 /*
                  * Don't fail the drive, act as though we were just a
                  * normal single drive.
                  * However don't try a recovery from this drive as
                  * it is very likely to fail.
                  */
                 conf->recovery_disabled = mddev->recovery_disabled;
                 return;
         }
         ...
}

The end result is that even if all the drives are physically gone, there 
still one drive remains in array forever, and mdadm continues to report 
the array is degraded instead of failed. RAID10 also has similar behavior.

Is there any reason we absolutely don't want to fail the last drive of 
RAID1?

Thanks
Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Last working drive in RAID1
  2015-03-04 19:55 Last working drive in RAID1 Eric Mei
@ 2015-03-04 21:46 ` NeilBrown
  2015-03-04 22:48   ` Eric Mei
  0 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2015-03-04 21:46 UTC (permalink / raw)
  To: Eric Mei; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1699 bytes --]

On Wed, 04 Mar 2015 12:55:43 -0700 Eric Mei <meijia@gmail.com> wrote:

> Hi,
> 
> It is interesting to notice that RAID1 won't mark the last working drive 
> as Faulty no matter what. The responsible code seems here:
> 
> static void error(struct mddev *mddev, struct md_rdev *rdev)
> {
>          ...
>          /*
>           * If it is not operational, then we have already marked it as dead
>           * else if it is the last working disks, ignore the error, let the
>           * next level up know.
>           * else mark the drive as failed
>           */
>          if (test_bit(In_sync, &rdev->flags)
>              && (conf->raid_disks - mddev->degraded) == 1) {
>                  /*
>                   * Don't fail the drive, act as though we were just a
>                   * normal single drive.
>                   * However don't try a recovery from this drive as
>                   * it is very likely to fail.
>                   */
>                  conf->recovery_disabled = mddev->recovery_disabled;
>                  return;
>          }
>          ...
> }
> 
> The end result is that even if all the drives are physically gone, there 
> still one drive remains in array forever, and mdadm continues to report 
> the array is degraded instead of failed. RAID10 also has similar behavior.
> 
> Is there any reason we absolutely don't want to fail the last drive of 
> RAID1?
> 

When a RAID1 only has one drive remaining, then it should act as much as
possible like a single plain ordinary drive.

How does /dev/sda behave when you physically remove the device?  md0 (as a
raid1 with one drive) should do the same.

NeilBrown

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Last working drive in RAID1
  2015-03-04 21:46 ` NeilBrown
@ 2015-03-04 22:48   ` Eric Mei
  2015-03-04 23:26     ` NeilBrown
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Mei @ 2015-03-04 22:48 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,

I see, that does make sense. Thank you.

But it impose a problem for HA. We have 2 nodes as active-standby pair, 
if HW on node 1 have problem (e.g. SAS cable get pulled, thus all access 
to physical drives are gone), we hope the array failover to node 2. But 
with lingering drive reference, mdadm will report array is still alive 
thus failover won't happen.

I guess it depends on what kind of error on the drive. If it's just a 
media error we should keep it online as much as possible. But if the 
drive is really bad or physically gone, keeping the stale reference 
won't help anything. Back to your comparison with single drive /dev/sda, 
I think MD as an array should do the same as /dev/sda, not the 
individual drive inside MD, for them we should just let it go. How do 
you think?

Eric

On 2015-03-04 2:46 PM, NeilBrown wrote:
> On Wed, 04 Mar 2015 12:55:43 -0700 Eric Mei <meijia@gmail.com> wrote:
>
>> Hi,
>>
>> It is interesting to notice that RAID1 won't mark the last working drive
>> as Faulty no matter what. The responsible code seems here:
>>
>> static void error(struct mddev *mddev, struct md_rdev *rdev)
>> {
>>           ...
>>           /*
>>            * If it is not operational, then we have already marked it as dead
>>            * else if it is the last working disks, ignore the error, let the
>>            * next level up know.
>>            * else mark the drive as failed
>>            */
>>           if (test_bit(In_sync, &rdev->flags)
>>               && (conf->raid_disks - mddev->degraded) == 1) {
>>                   /*
>>                    * Don't fail the drive, act as though we were just a
>>                    * normal single drive.
>>                    * However don't try a recovery from this drive as
>>                    * it is very likely to fail.
>>                    */
>>                   conf->recovery_disabled = mddev->recovery_disabled;
>>                   return;
>>           }
>>           ...
>> }
>>
>> The end result is that even if all the drives are physically gone, there
>> still one drive remains in array forever, and mdadm continues to report
>> the array is degraded instead of failed. RAID10 also has similar behavior.
>>
>> Is there any reason we absolutely don't want to fail the last drive of
>> RAID1?
>>
> When a RAID1 only has one drive remaining, then it should act as much as
> possible like a single plain ordinary drive.
>
> How does /dev/sda behave when you physically remove the device?  md0 (as a
> raid1 with one drive) should do the same.
>
> NeilBrown


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Last working drive in RAID1
  2015-03-04 22:48   ` Eric Mei
@ 2015-03-04 23:26     ` NeilBrown
  2015-03-05 15:55       ` Wols Lists
  2015-03-05 20:23       ` Eric Mei
  0 siblings, 2 replies; 11+ messages in thread
From: NeilBrown @ 2015-03-04 23:26 UTC (permalink / raw)
  To: Eric Mei; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1812 bytes --]

On Wed, 04 Mar 2015 15:48:57 -0700 Eric Mei <meijia@gmail.com> wrote:

> Hi Neil,
> 
> I see, that does make sense. Thank you.
> 
> But it impose a problem for HA. We have 2 nodes as active-standby pair, 
> if HW on node 1 have problem (e.g. SAS cable get pulled, thus all access 
> to physical drives are gone), we hope the array failover to node 2. But 
> with lingering drive reference, mdadm will report array is still alive 
> thus failover won't happen.
> 
> I guess it depends on what kind of error on the drive. If it's just a 
> media error we should keep it online as much as possible. But if the 
> drive is really bad or physically gone, keeping the stale reference 
> won't help anything. Back to your comparison with single drive /dev/sda, 
> I think MD as an array should do the same as /dev/sda, not the 
> individual drive inside MD, for them we should just let it go. How do 
> you think?

If there were some what that md could be told that the device really was gone
and just just returning errors, then I would be OK with it being marked as
faulty and being removed from the array.

I don't think there is any mechanism in the kernel to allow that.  It would
be easiest to capture a "REMOVE" event via udev, and have udev run "mdadm" to
tell the md array that the device was gone.

Currently there is no way to do that ... I guess we could change raid1 so
that a 'fail' event that came from user-space  would always cause the device
to be marked failed, even when an IO error would not...
To preserve current behaviour, it should require something like "faulty-force"
to be written to the "state" file.   We would need to check that raid1 copes
with having zero working drives - currently it might always assume there is
at least one device.

NeilBrown


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Last working drive in RAID1
  2015-03-04 23:26     ` NeilBrown
@ 2015-03-05 15:55       ` Wols Lists
  2015-03-05 19:54         ` Eric Mei
  2015-03-05 20:00         ` Phil Turmel
  2015-03-05 20:23       ` Eric Mei
  1 sibling, 2 replies; 11+ messages in thread
From: Wols Lists @ 2015-03-05 15:55 UTC (permalink / raw)
  To: NeilBrown, Eric Mei; +Cc: linux-raid

On 04/03/15 23:26, NeilBrown wrote:
> On Wed, 04 Mar 2015 15:48:57 -0700 Eric Mei <meijia@gmail.com>
> wrote:
> 
>> Hi Neil,
>> 
>> I see, that does make sense. Thank you.
>> 
>> But it impose a problem for HA. We have 2 nodes as active-standby
>> pair, if HW on node 1 have problem (e.g. SAS cable get pulled,
>> thus all access to physical drives are gone), we hope the array
>> failover to node 2. But with lingering drive reference, mdadm
>> will report array is still alive thus failover won't happen.
>> 
>> I guess it depends on what kind of error on the drive. If it's
>> just a media error we should keep it online as much as possible.
>> But if the drive is really bad or physically gone, keeping the
>> stale reference won't help anything. Back to your comparison with
>> single drive /dev/sda, I think MD as an array should do the same
>> as /dev/sda, not the individual drive inside MD, for them we
>> should just let it go. How do you think?
> 
> If there were some what that md could be told that the device
> really was gone and just just returning errors, then I would be OK
> with it being marked as faulty and being removed from the array.
> 
> I don't think there is any mechanism in the kernel to allow that.
> It would be easiest to capture a "REMOVE" event via udev, and have
> udev run "mdadm" to tell the md array that the device was gone.
> 
> Currently there is no way to do that ... I guess we could change
> raid1 so that a 'fail' event that came from user-space  would
> always cause the device to be marked failed, even when an IO error
> would not... To preserve current behaviour, it should require
> something like "faulty-force" to be written to the "state" file.
> We would need to check that raid1 copes with having zero working
> drives - currently it might always assume there is at least one
> device.
> 
Sorry to butt in, but I'm finding this conversation a bit surreal ...
take everything I say with a pinch of salt. But the really weird bit
was "what does linux do if /dev/sda disappears?"

In the old days, with /dev/hd*, the * had a hard mapping to the
hardware. hda was the ide0 primary, hdd was the ide1 secondary, etc
etc. I think I ran several systems with just hdb and hdd. Not a good
idea, but.

Nowadays, with sd*, the letter is assigned in order of finding the
drive. So if sda is removed, linux moves all the other drives and what
was sdb becomes sda. Which is why you're advised now to always refer
to drives by their BLKDEV or whatever, as linux provides no guarantees
whatsoever about sd*. The blockdev may only be a symlink to whatever
the sd*n code of the disk is, but it makes sure you get the disk you
want when the sd*n changes under you.

Equally surreal is the comment about "what does raid1 do with no
working devices?". Surely it will do nothing, if there's no spinning
rust or whatever underneath it? You can't corrupt it if there's
nothing there to corrupt?

Sorry again if this is inappropriate, but you're coming over as so
buried in the trees that you can't see the wood.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Last working drive in RAID1
  2015-03-05 15:55       ` Wols Lists
@ 2015-03-05 19:54         ` Eric Mei
  2015-03-05 20:00         ` Phil Turmel
  1 sibling, 0 replies; 11+ messages in thread
From: Eric Mei @ 2015-03-05 19:54 UTC (permalink / raw)
  To: Wols Lists, NeilBrown; +Cc: linux-raid

On 2015-03-05 8:55 AM, Wols Lists wrote:
> On 04/03/15 23:26, NeilBrown wrote:
>> On Wed, 04 Mar 2015 15:48:57 -0700 Eric Mei <meijia@gmail.com>
>> wrote:
>>
>>> Hi Neil,
>>>
>>> I see, that does make sense. Thank you.
>>>
>>> But it impose a problem for HA. We have 2 nodes as active-standby
>>> pair, if HW on node 1 have problem (e.g. SAS cable get pulled,
>>> thus all access to physical drives are gone), we hope the array
>>> failover to node 2. But with lingering drive reference, mdadm
>>> will report array is still alive thus failover won't happen.
>>>
>>> I guess it depends on what kind of error on the drive. If it's
>>> just a media error we should keep it online as much as possible.
>>> But if the drive is really bad or physically gone, keeping the
>>> stale reference won't help anything. Back to your comparison with
>>> single drive /dev/sda, I think MD as an array should do the same
>>> as /dev/sda, not the individual drive inside MD, for them we
>>> should just let it go. How do you think?
>> If there were some what that md could be told that the device
>> really was gone and just just returning errors, then I would be OK
>> with it being marked as faulty and being removed from the array.
>>
>> I don't think there is any mechanism in the kernel to allow that.
>> It would be easiest to capture a "REMOVE" event via udev, and have
>> udev run "mdadm" to tell the md array that the device was gone.
>>
>> Currently there is no way to do that ... I guess we could change
>> raid1 so that a 'fail' event that came from user-space  would
>> always cause the device to be marked failed, even when an IO error
>> would not... To preserve current behaviour, it should require
>> something like "faulty-force" to be written to the "state" file.
>> We would need to check that raid1 copes with having zero working
>> drives - currently it might always assume there is at least one
>> device.
>>
> Sorry to butt in, but I'm finding this conversation a bit surreal ...
> take everything I say with a pinch of salt. But the really weird bit
> was "what does linux do if /dev/sda disappears?"
>
> In the old days, with /dev/hd*, the * had a hard mapping to the
> hardware. hda was the ide0 primary, hdd was the ide1 secondary, etc
> etc. I think I ran several systems with just hdb and hdd. Not a good
> idea, but.
>
> Nowadays, with sd*, the letter is assigned in order of finding the
> drive. So if sda is removed, linux moves all the other drives and what
> was sdb becomes sda. Which is why you're advised now to always refer
> to drives by their BLKDEV or whatever, as linux provides no guarantees
> whatsoever about sd*. The blockdev may only be a symlink to whatever
> the sd*n code of the disk is, but it makes sure you get the disk you
> want when the sd*n changes under you.
>
> Equally surreal is the comment about "what does raid1 do with no
> working devices?". Surely it will do nothing, if there's no spinning
> rust or whatever underneath it? You can't corrupt it if there's
> nothing there to corrupt?
>
> Sorry again if this is inappropriate, but you're coming over as so
> buried in the trees that you can't see the wood.
>
> Cheers,
> Wol
Hi Wol,

I think Neil's intention regarding "/dev/sda" waslike this: For a single 
drive if it's physically gone, its user will still keep the reference to 
it; and RAID1 with single drive should behave the same, i.e. without 
knowledge of what exactly happened on this drive, MD is not comfortable 
to make decision for application about the status of the last drive, 
thus refuse to mark it as failed. The whole thing has not much to do 
with FS arrangement under /dev, which is usually managed by udev.

Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Last working drive in RAID1
  2015-03-05 15:55       ` Wols Lists
  2015-03-05 19:54         ` Eric Mei
@ 2015-03-05 20:00         ` Phil Turmel
  2015-03-05 21:52           ` NeilBrown
  2015-03-05 21:54           ` Chris
  1 sibling, 2 replies; 11+ messages in thread
From: Phil Turmel @ 2015-03-05 20:00 UTC (permalink / raw)
  To: Wols Lists, NeilBrown, Eric Mei; +Cc: linux-raid

On 03/05/2015 10:55 AM, Wols Lists wrote:
> Sorry to butt in, but I'm finding this conversation a bit surreal ...
> take everything I say with a pinch of salt. But the really weird bit
> was "what does linux do if /dev/sda disappears?"
> 
> In the old days, with /dev/hd*, the * had a hard mapping to the
> hardware. hda was the ide0 primary, hdd was the ide1 secondary, etc
> etc. I think I ran several systems with just hdb and hdd. Not a good
> idea, but.
> 
> Nowadays, with sd*, the letter is assigned in order of finding the
> drive. So if sda is removed, linux moves all the other drives and what
> was sdb becomes sda.

On reboot, yes.  Not live.  If /dev/sda has anything using it when
unplugged, it stays there and gives errors.  Existing devices retain
their names, and new devices are added to the end.  Only if all users
are completely unhooked will a kernel name get re-used live.

> Which is why you're advised now to always refer
> to drives by their BLKDEV or whatever, as linux provides no guarantees
> whatsoever about sd*. The blockdev may only be a symlink to whatever
> the sd*n code of the disk is, but it makes sure you get the disk you
> want when the sd*n changes under you.

Correct, but once assembled or mounted, the kernel names are locked.

> Equally surreal is the comment about "what does raid1 do with no
> working devices?". Surely it will do nothing, if there's no spinning
> rust or whatever underneath it? You can't corrupt it if there's
> nothing there to corrupt?

It has to stay there to give errors to the upper layers that are still
hooked to it.  When they are administratively "unhooked", aka unmounted
or disassociated with mdadm --remove.

Or, quite possibly, the device is plugged back in, at which point the
device name is there for it (as long as you use the same port, of
course).  In which case the filesystem may very well resume successfully.

> Sorry again if this is inappropriate, but you're coming over as so
> buried in the trees that you can't see the wood.

Sometimes the expanded view is appropriate :-)

Regards,

Phil Turmel


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Last working drive in RAID1
  2015-03-04 23:26     ` NeilBrown
  2015-03-05 15:55       ` Wols Lists
@ 2015-03-05 20:23       ` Eric Mei
  1 sibling, 0 replies; 11+ messages in thread
From: Eric Mei @ 2015-03-05 20:23 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 2015-03-04 4:26 PM, NeilBrown wrote:
> On Wed, 04 Mar 2015 15:48:57 -0700 Eric Mei <meijia@gmail.com> wrote:
>
>> Hi Neil,
>>
>> I see, that does make sense. Thank you.
>>
>> But it impose a problem for HA. We have 2 nodes as active-standby pair,
>> if HW on node 1 have problem (e.g. SAS cable get pulled, thus all access
>> to physical drives are gone), we hope the array failover to node 2. But
>> with lingering drive reference, mdadm will report array is still alive
>> thus failover won't happen.
>>
>> I guess it depends on what kind of error on the drive. If it's just a
>> media error we should keep it online as much as possible. But if the
>> drive is really bad or physically gone, keeping the stale reference
>> won't help anything. Back to your comparison with single drive /dev/sda,
>> I think MD as an array should do the same as /dev/sda, not the
>> individual drive inside MD, for them we should just let it go. How do
>> you think?
> If there were some what that md could be told that the device really was gone
> and just just returning errors, then I would be OK with it being marked as
> faulty and being removed from the array.
>
> I don't think there is any mechanism in the kernel to allow that.  It would
> be easiest to capture a "REMOVE" event via udev, and have udev run "mdadm" to
> tell the md array that the device was gone.
>
> Currently there is no way to do that ... I guess we could change raid1 so
> that a 'fail' event that came from user-space  would always cause the device
> to be marked failed, even when an IO error would not...
> To preserve current behaviour, it should require something like "faulty-force"
> to be written to the "state" file.   We would need to check that raid1 copes
> with having zero working drives - currently it might always assume there is
> at least one device.
I guess we don't need to know exactly what happened physically, it 
should be good enough to know "drive stopped working". If a drive 
stopped working, keeping it doesn't add much value anyway. And I think 
serious error detected in MD (e.g. superblock write error, bad block 
table write error) might be a good criteria to make that judgement.

But as you said current code may assume at least one drive present, need 
a more careful review.

Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Last working drive in RAID1
  2015-03-05 20:00         ` Phil Turmel
@ 2015-03-05 21:52           ` NeilBrown
  2015-03-06  9:21             ` Chris
  2015-03-05 21:54           ` Chris
  1 sibling, 1 reply; 11+ messages in thread
From: NeilBrown @ 2015-03-05 21:52 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Wols Lists, Eric Mei, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2692 bytes --]

On Thu, 05 Mar 2015 15:00:18 -0500 Phil Turmel <philip@turmel.org> wrote:

> On 03/05/2015 10:55 AM, Wols Lists wrote:
> > Sorry to butt in, but I'm finding this conversation a bit surreal ...
> > take everything I say with a pinch of salt. But the really weird bit
> > was "what does linux do if /dev/sda disappears?"
> > 
> > In the old days, with /dev/hd*, the * had a hard mapping to the
> > hardware. hda was the ide0 primary, hdd was the ide1 secondary, etc
> > etc. I think I ran several systems with just hdb and hdd. Not a good
> > idea, but.
> > 
> > Nowadays, with sd*, the letter is assigned in order of finding the
> > drive. So if sda is removed, linux moves all the other drives and what
> > was sdb becomes sda.
> 
> On reboot, yes.  Not live.  If /dev/sda has anything using it when
> unplugged, it stays there and gives errors.  Existing devices retain
> their names, and new devices are added to the end.  Only if all users
> are completely unhooked will a kernel name get re-used live.
> 
> > Which is why you're advised now to always refer
> > to drives by their BLKDEV or whatever, as linux provides no guarantees
> > whatsoever about sd*. The blockdev may only be a symlink to whatever
> > the sd*n code of the disk is, but it makes sure you get the disk you
> > want when the sd*n changes under you.
> 
> Correct, but once assembled or mounted, the kernel names are locked.
> 
> > Equally surreal is the comment about "what does raid1 do with no
> > working devices?". Surely it will do nothing, if there's no spinning
> > rust or whatever underneath it? You can't corrupt it if there's
> > nothing there to corrupt?
> 
> It has to stay there to give errors to the upper layers that are still
> hooked to it.  When they are administratively "unhooked", aka unmounted
> or disassociated with mdadm --remove.
> 
> Or, quite possibly, the device is plugged back in, at which point the
> device name is there for it (as long as you use the same port, of
> course).  In which case the filesystem may very well resume successfully.

I was with you right up to this last point.
When a device is unplugged and then plugged back in, it will always get a new
name.  Detecting "is the same device" is far from fool-proof, particularly as
the device could have been plugged into some other machine and has 'fsck' etc
run.
Once a mounted device is unplugged, that mount is permanently unusable.

NeilBrown


> 
> > Sorry again if this is inappropriate, but you're coming over as so
> > buried in the trees that you can't see the wood.
> 
> Sometimes the expanded view is appropriate :-)
> 
> Regards,
> 
> Phil Turmel


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Last working drive in RAID1
  2015-03-05 20:00         ` Phil Turmel
  2015-03-05 21:52           ` NeilBrown
@ 2015-03-05 21:54           ` Chris
  1 sibling, 0 replies; 11+ messages in thread
From: Chris @ 2015-03-05 21:54 UTC (permalink / raw)
  To: linux-raid

Am Thu, 05 Mar 2015 15:00:18 -0500
schrieb Phil Turmel <philip@turmel.org>:

> It has to stay there to give errors to the upper layers that are still
> hooked to it.  When they are administratively "unhooked", aka
> unmounted or disassociated with mdadm --remove.
> 
> Or, quite possibly, the device is plugged back in, at which point the
> device name is there for it (as long as you use the same port, of
> course).  In which case the filesystem may very well resume
> successfully.

From reading this it makes sense that the md device stays there, just
as the the physical device nodes. (to give errors, and to recover)

However, as I understood this thread, md does not seem to inform upper
layers or the user (even not through its own --monitor?) properly.
To me, marking the last disk within an array as failed (*within* the
array) just seems to make more sense, so /proc/mdstat actually
iforms about the md error state (and the md device returning errors on
access).

Regards,
Chris

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Last working drive in RAID1
  2015-03-05 21:52           ` NeilBrown
@ 2015-03-06  9:21             ` Chris
  0 siblings, 0 replies; 11+ messages in thread
From: Chris @ 2015-03-06  9:21 UTC (permalink / raw)
  To: linux-raid

Am Fri, 6 Mar 2015 08:52:22 +1100
schrieb NeilBrown <neilb@suse.de>:

> > Or, quite possibly, the device is plugged back in, at which point
> > the device name is there for it (as long as you use the same port,
> > of course).  In which case the filesystem may very well resume
> > successfully.
> 
> I was with you right up to this last point.
> When a device is unplugged and then plugged back in, it will always
> get a new name.

Right, that is what I see when "rotating" disks or connecting to a
docking station. And I did also see the last disk of an external
storage raid not going away (fail) when unplugged.

The times I had devices really break though, I don't think they
triggered an unplug event. They still seemed fully connected but had
motor-start, bus errors or something. In some occasions the failure was
only intermittent, and the device node continued to work again (within
the controller timeout or without a permanent error remaining after a
reset).


So if the last working drive could be marked as failed, when it
actually failed, that would also provide the proper information for
system failover on replicated hosts, in the many cases when a
controller/bus/drive fails without an unplug event. Cases that the udev
rule only idea does not seem to cover.

Regards,
Chris

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-03-06  9:21 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-04 19:55 Last working drive in RAID1 Eric Mei
2015-03-04 21:46 ` NeilBrown
2015-03-04 22:48   ` Eric Mei
2015-03-04 23:26     ` NeilBrown
2015-03-05 15:55       ` Wols Lists
2015-03-05 19:54         ` Eric Mei
2015-03-05 20:00         ` Phil Turmel
2015-03-05 21:52           ` NeilBrown
2015-03-06  9:21             ` Chris
2015-03-05 21:54           ` Chris
2015-03-05 20:23       ` Eric Mei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.