All of lore.kernel.org
 help / color / mirror / Atom feed
* Wish list : automatic rebuild with hot swap osd ?
@ 2017-10-17 21:06 Yoann Moulin
  2017-10-17 21:22 ` Sage Weil
  0 siblings, 1 reply; 9+ messages in thread
From: Yoann Moulin @ 2017-10-17 21:06 UTC (permalink / raw)
  To: ceph-devel

Hello,

I wonder if it's possible to add to ceph the ability to rebuild a new disk automatically freshly added to a slot in replacement of a failed osd
disk.

I imagine something like adding a flag on disks (identified by-path for example or a way to have a deterministic access to the device) to be
auto-reconfigured if detected with no data after the same slot was marked as failed. Only disks identified "by-path" with an auto-rebuild flag
activated, and previously marked as failed, can run the auto-reconfiguration.

I know it should take some time to find the best way to implement that feature to avoid zapping a disk with data but that would be great to
improve maintainability.

Thanks,

-- 
Yoann Moulin
EPFL IC-IT

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Wish list : automatic rebuild with hot swap osd ?
  2017-10-17 21:06 Wish list : automatic rebuild with hot swap osd ? Yoann Moulin
@ 2017-10-17 21:22 ` Sage Weil
  2017-10-18 16:02   ` alan somers
  0 siblings, 1 reply; 9+ messages in thread
From: Sage Weil @ 2017-10-17 21:22 UTC (permalink / raw)
  To: Yoann Moulin; +Cc: ceph-devel

On Tue, 17 Oct 2017, Yoann Moulin wrote:
> Hello,
> 
> I wonder if it's possible to add to ceph the ability to rebuild a new disk automatically freshly added to a slot in replacement of a failed osd
> disk.
> 
> I imagine something like adding a flag on disks (identified by-path for example or a way to have a deterministic access to the device) to be
> auto-reconfigured if detected with no data after the same slot was marked as failed. Only disks identified "by-path" with an auto-rebuild flag
> activated, and previously marked as failed, can run the auto-reconfiguration.
> 
> I know it should take some time to find the best way to implement that feature to avoid zapping a disk with data but that would be great to
> improve maintainability.

The way that we approached this before with ceph-disk was that you would 
prelabel replacement disks as "blank ceph" or similar.  That way if we saw 
a ceph labeled disk that hadn't been used yet we would know it was fair 
game.  This is a lot more work for the admin (you have to attached each of 
your replacement disks to mark them and then put them in the replacement 
pile) but it is safer.

An alternative model would be to have a reasonably reliable way to 
identify a fresh disk from the factory.  I'm honest not sure what new 
disks look like these days (zero? empty NTFS partition?) but we could try 
to recognized "blank" and go from there.

Unfortunately we can't assume blank if we see garbage because the disk 
might be encrypted.

Anyway, assuming that part was sorted out, I think a complete solution 
would query the mons to see what OSD id used to live in that particular 
by-path location/slot and try to re-use that OSD ID.  We have still built 
in a manual task here of marking the failed osd "destroyed" since reusing 
the ID means the cluster may make assumptions about PG copies on the OSD 
being lost.

sage

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Wish list : automatic rebuild with hot swap osd ?
  2017-10-17 21:22 ` Sage Weil
@ 2017-10-18 16:02   ` alan somers
  2017-10-18 16:25     ` Sage Weil
  0 siblings, 1 reply; 9+ messages in thread
From: alan somers @ 2017-10-18 16:02 UTC (permalink / raw)
  To: Sage Weil; +Cc: Yoann Moulin, ceph-devel

On Tue, Oct 17, 2017 at 3:22 PM, Sage Weil <sage@newdream.net> wrote:
> On Tue, 17 Oct 2017, Yoann Moulin wrote:
>> Hello,
>>
>> I wonder if it's possible to add to ceph the ability to rebuild a new disk automatically freshly added to a slot in replacement of a failed osd
>> disk.
>>
>> I imagine something like adding a flag on disks (identified by-path for example or a way to have a deterministic access to the device) to be
>> auto-reconfigured if detected with no data after the same slot was marked as failed. Only disks identified "by-path" with an auto-rebuild flag
>> activated, and previously marked as failed, can run the auto-reconfiguration.
>>
>> I know it should take some time to find the best way to implement that feature to avoid zapping a disk with data but that would be great to
>> improve maintainability.
>
> The way that we approached this before with ceph-disk was that you would
> prelabel replacement disks as "blank ceph" or similar.  That way if we saw
> a ceph labeled disk that hadn't been used yet we would know it was fair
> game.  This is a lot more work for the admin (you have to attached each of
> your replacement disks to mark them and then put them in the replacement
> pile) but it is safer.
>
> An alternative model would be to have a reasonably reliable way to
> identify a fresh disk from the factory.  I'm honest not sure what new
> disks look like these days (zero? empty NTFS partition?) but we could try
> to recognized "blank" and go from there.
>
> Unfortunately we can't assume blank if we see garbage because the disk
> might be encrypted.
>
> Anyway, assuming that part was sorted out, I think a complete solution
> would query the mons to see what OSD id used to live in that particular
> by-path location/slot and try to re-use that OSD ID.  We have still built
> in a manual task here of marking the failed osd "destroyed" since reusing
> the ID means the cluster may make assumptions about PG copies on the OSD
> being lost.
>
> sage

ZFS handles this with the "autoreplace" pool flag.  If that flag is
set, then a blank drive inserted into a slot formerly occupied by a
member of the pool will automatically be used to replace that former
member.  The nice thing about this scheme is that it requires no
per-drive effort on the part of the sysadmin, and it doesn't touch
drives inserted into other slots.  The less nice thing is that it only
works with SES expanders that provide slot information.  I don't like
Sage's second suggestion because it basically takes over the entire
server.  If newly inserted blank drives are instantly gobbled up by
Ceph, then they can't be used by anything else.  IMHO that kind of
greedy functionality shouldn't be builtin to something as general as
Ceph (though perhaps it could go in a separately installed daemon).

-Alan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Wish list : automatic rebuild with hot swap osd ?
  2017-10-18 16:02   ` alan somers
@ 2017-10-18 16:25     ` Sage Weil
  2017-10-19 12:14       ` Alfredo Deza
  2017-10-19 12:30       ` John Spray
  0 siblings, 2 replies; 9+ messages in thread
From: Sage Weil @ 2017-10-18 16:25 UTC (permalink / raw)
  To: alan somers; +Cc: Yoann Moulin, ceph-devel, jspray

On Wed, 18 Oct 2017, alan somers wrote:
> On Tue, Oct 17, 2017 at 3:22 PM, Sage Weil <sage@newdream.net> wrote:
> > On Tue, 17 Oct 2017, Yoann Moulin wrote:
> >> Hello,
> >>
> >> I wonder if it's possible to add to ceph the ability to rebuild a new disk automatically freshly added to a slot in replacement of a failed osd
> >> disk.
> >>
> >> I imagine something like adding a flag on disks (identified by-path for example or a way to have a deterministic access to the device) to be
> >> auto-reconfigured if detected with no data after the same slot was marked as failed. Only disks identified "by-path" with an auto-rebuild flag
> >> activated, and previously marked as failed, can run the auto-reconfiguration.
> >>
> >> I know it should take some time to find the best way to implement that feature to avoid zapping a disk with data but that would be great to
> >> improve maintainability.
> >
> > The way that we approached this before with ceph-disk was that you would
> > prelabel replacement disks as "blank ceph" or similar.  That way if we saw
> > a ceph labeled disk that hadn't been used yet we would know it was fair
> > game.  This is a lot more work for the admin (you have to attached each of
> > your replacement disks to mark them and then put them in the replacement
> > pile) but it is safer.
> >
> > An alternative model would be to have a reasonably reliable way to
> > identify a fresh disk from the factory.  I'm honest not sure what new
> > disks look like these days (zero? empty NTFS partition?) but we could try
> > to recognized "blank" and go from there.
> >
> > Unfortunately we can't assume blank if we see garbage because the disk
> > might be encrypted.
> >
> > Anyway, assuming that part was sorted out, I think a complete solution
> > would query the mons to see what OSD id used to live in that particular
> > by-path location/slot and try to re-use that OSD ID.  We have still built
> > in a manual task here of marking the failed osd "destroyed" since reusing
> > the ID means the cluster may make assumptions about PG copies on the OSD
> > being lost.
> >
> > sage
> 
> ZFS handles this with the "autoreplace" pool flag.  If that flag is
> set, then a blank drive inserted into a slot formerly occupied by a
> member of the pool will automatically be used to replace that former
> member.  The nice thing about this scheme is that it requires no
> per-drive effort on the part of the sysadmin, and it doesn't touch
> drives inserted into other slots.  The less nice thing is that it only
> works with SES expanders that provide slot information.  I don't like
> Sage's second suggestion because it basically takes over the entire
> server.  If newly inserted blank drives are instantly gobbled up by
> Ceph, then they can't be used by anything else.  IMHO that kind of
> greedy functionality shouldn't be builtin to something as general as
> Ceph (though perhaps it could go in a separately installed daemon).

I think in our case it would be a separate daemon either way that is 
monitoring slots and reprovisioning OSDs when appropriate.  I like this 
approach!  I think it would involve:

- A cluster option, not a pool one, since Ceph pools are about 
logical data collections, not hardware.

- A clearer mapping between OSDs and the block device(s) they consume, and 
some additional metadata on those devices (in this case, the 
/dev/disk/by-path string ought to suffice).  I know John has done some 
work here but I think things are still a bit ad hoc.  For example, the osd 
metadata reporting for bluestore devices is pretty unstructured.  (We also 
want a clearer device list and properties for devices for the SMART 
data reporting.)

- A daemon (e.g., ceph-osd-autoreplace) that runs on each machine or a 
tool that is triggered by udev.  It would check for new, empty devices 
appearing in the locations (as defined by the by-path string) previously 
occupied by OSDs that are down.  If that happens, it can use 'ceph osd 
safe-to-destroy' to verify whether it is safe to automatically rebuild 
that OSD.  (If not, it might want to raise a health alert, since it's 
possible the drive that was physically pulled should be preserved until 
the cluster is sure it doesn't need it.)

?

sage


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Wish list : automatic rebuild with hot swap osd ?
  2017-10-18 16:25     ` Sage Weil
@ 2017-10-19 12:14       ` Alfredo Deza
  2017-10-19 13:08         ` Willem Jan Withagen
  2017-10-19 12:30       ` John Spray
  1 sibling, 1 reply; 9+ messages in thread
From: Alfredo Deza @ 2017-10-19 12:14 UTC (permalink / raw)
  To: Sage Weil; +Cc: alan somers, Yoann Moulin, ceph-devel, John Spray

On Wed, Oct 18, 2017 at 12:25 PM, Sage Weil <sage@newdream.net> wrote:
> On Wed, 18 Oct 2017, alan somers wrote:
>> On Tue, Oct 17, 2017 at 3:22 PM, Sage Weil <sage@newdream.net> wrote:
>> > On Tue, 17 Oct 2017, Yoann Moulin wrote:
>> >> Hello,
>> >>
>> >> I wonder if it's possible to add to ceph the ability to rebuild a new disk automatically freshly added to a slot in replacement of a failed osd
>> >> disk.
>> >>
>> >> I imagine something like adding a flag on disks (identified by-path for example or a way to have a deterministic access to the device) to be
>> >> auto-reconfigured if detected with no data after the same slot was marked as failed. Only disks identified "by-path" with an auto-rebuild flag
>> >> activated, and previously marked as failed, can run the auto-reconfiguration.
>> >>
>> >> I know it should take some time to find the best way to implement that feature to avoid zapping a disk with data but that would be great to
>> >> improve maintainability.
>> >
>> > The way that we approached this before with ceph-disk was that you would
>> > prelabel replacement disks as "blank ceph" or similar.  That way if we saw
>> > a ceph labeled disk that hadn't been used yet we would know it was fair
>> > game.  This is a lot more work for the admin (you have to attached each of
>> > your replacement disks to mark them and then put them in the replacement
>> > pile) but it is safer.
>> >
>> > An alternative model would be to have a reasonably reliable way to
>> > identify a fresh disk from the factory.  I'm honest not sure what new
>> > disks look like these days (zero? empty NTFS partition?) but we could try
>> > to recognized "blank" and go from there.
>> >
>> > Unfortunately we can't assume blank if we see garbage because the disk
>> > might be encrypted.
>> >
>> > Anyway, assuming that part was sorted out, I think a complete solution
>> > would query the mons to see what OSD id used to live in that particular
>> > by-path location/slot and try to re-use that OSD ID.  We have still built
>> > in a manual task here of marking the failed osd "destroyed" since reusing
>> > the ID means the cluster may make assumptions about PG copies on the OSD
>> > being lost.
>> >
>> > sage
>>
>> ZFS handles this with the "autoreplace" pool flag.  If that flag is
>> set, then a blank drive inserted into a slot formerly occupied by a
>> member of the pool will automatically be used to replace that former
>> member.  The nice thing about this scheme is that it requires no
>> per-drive effort on the part of the sysadmin, and it doesn't touch
>> drives inserted into other slots.  The less nice thing is that it only
>> works with SES expanders that provide slot information.  I don't like
>> Sage's second suggestion because it basically takes over the entire
>> server.  If newly inserted blank drives are instantly gobbled up by
>> Ceph, then they can't be used by anything else.  IMHO that kind of
>> greedy functionality shouldn't be builtin to something as general as
>> Ceph (though perhaps it could go in a separately installed daemon).
>
> I think in our case it would be a separate daemon either way that is
> monitoring slots and reprovisioning OSDs when appropriate.  I like this
> approach!  I think it would involve:
>
> - A cluster option, not a pool one, since Ceph pools are about
> logical data collections, not hardware.
>
> - A clearer mapping between OSDs and the block device(s) they consume, and
> some additional metadata on those devices (in this case, the
> /dev/disk/by-path string ought to suffice).

I am hesitant to rely on by-path here, those can change if devices
change ports. While testing ceph-volume to make it
resilient on device name changes, we  weren't able to rely on by-path.
This is easy to test on a VM and changing the port
number where the disk is being plugged in.

Unless we are describing a scenario where the ports wouldn't change,
the system would not reboot, and a bad-one-out / good-one-in would
always be the case?

> I know John has done some
> work here but I think things are still a bit ad hoc.  For example, the osd
> metadata reporting for bluestore devices is pretty unstructured.  (We also
> want a clearer device list and properties for devices for the SMART
> data reporting.)

If these properties also mean device information (vendor, size,
solid/rotational, etc...) it could help
to better map/detect an OSD replacement since clusters tend to have a
certain level of
homogeneous hardware: if $brand, and $size, and $rotational etc...

>
> - A daemon (e.g., ceph-osd-autoreplace) that runs on each machine or a
> tool that is triggered by udev.  It would check for new, empty devices
> appearing in the locations (as defined by the by-path string) previously
> occupied by OSDs that are down.  If that happens, it can use 'ceph osd
> safe-to-destroy' to verify whether it is safe to automatically rebuild
> that OSD.  (If not, it might want to raise a health alert, since it's
> possible the drive that was physically pulled should be preserved until
> the cluster is sure it doesn't need it.)

systemd has some support for devices, so we might not even need a
daemon, but more a unit that can
depend on events already handled by systemd (would save us from udev).

>
> ?
>
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Wish list : automatic rebuild with hot swap osd ?
  2017-10-18 16:25     ` Sage Weil
  2017-10-19 12:14       ` Alfredo Deza
@ 2017-10-19 12:30       ` John Spray
  1 sibling, 0 replies; 9+ messages in thread
From: John Spray @ 2017-10-19 12:30 UTC (permalink / raw)
  To: Sage Weil; +Cc: alan somers, Yoann Moulin, ceph-devel

On Wed, Oct 18, 2017 at 5:25 PM, Sage Weil <sage@newdream.net> wrote:
> On Wed, 18 Oct 2017, alan somers wrote:
>> On Tue, Oct 17, 2017 at 3:22 PM, Sage Weil <sage@newdream.net> wrote:
>> > On Tue, 17 Oct 2017, Yoann Moulin wrote:
>> >> Hello,
>> >>
>> >> I wonder if it's possible to add to ceph the ability to rebuild a new disk automatically freshly added to a slot in replacement of a failed osd
>> >> disk.
>> >>
>> >> I imagine something like adding a flag on disks (identified by-path for example or a way to have a deterministic access to the device) to be
>> >> auto-reconfigured if detected with no data after the same slot was marked as failed. Only disks identified "by-path" with an auto-rebuild flag
>> >> activated, and previously marked as failed, can run the auto-reconfiguration.
>> >>
>> >> I know it should take some time to find the best way to implement that feature to avoid zapping a disk with data but that would be great to
>> >> improve maintainability.
>> >
>> > The way that we approached this before with ceph-disk was that you would
>> > prelabel replacement disks as "blank ceph" or similar.  That way if we saw
>> > a ceph labeled disk that hadn't been used yet we would know it was fair
>> > game.  This is a lot more work for the admin (you have to attached each of
>> > your replacement disks to mark them and then put them in the replacement
>> > pile) but it is safer.
>> >
>> > An alternative model would be to have a reasonably reliable way to
>> > identify a fresh disk from the factory.  I'm honest not sure what new
>> > disks look like these days (zero? empty NTFS partition?) but we could try
>> > to recognized "blank" and go from there.
>> >
>> > Unfortunately we can't assume blank if we see garbage because the disk
>> > might be encrypted.
>> >
>> > Anyway, assuming that part was sorted out, I think a complete solution
>> > would query the mons to see what OSD id used to live in that particular
>> > by-path location/slot and try to re-use that OSD ID.  We have still built
>> > in a manual task here of marking the failed osd "destroyed" since reusing
>> > the ID means the cluster may make assumptions about PG copies on the OSD
>> > being lost.
>> >
>> > sage
>>
>> ZFS handles this with the "autoreplace" pool flag.  If that flag is
>> set, then a blank drive inserted into a slot formerly occupied by a
>> member of the pool will automatically be used to replace that former
>> member.  The nice thing about this scheme is that it requires no
>> per-drive effort on the part of the sysadmin, and it doesn't touch
>> drives inserted into other slots.  The less nice thing is that it only
>> works with SES expanders that provide slot information.  I don't like
>> Sage's second suggestion because it basically takes over the entire
>> server.  If newly inserted blank drives are instantly gobbled up by
>> Ceph, then they can't be used by anything else.  IMHO that kind of
>> greedy functionality shouldn't be builtin to something as general as
>> Ceph (though perhaps it could go in a separately installed daemon).
>
> I think in our case it would be a separate daemon either way that is
> monitoring slots and reprovisioning OSDs when appropriate.  I like this
> approach!  I think it would involve:
>
> - A cluster option, not a pool one, since Ceph pools are about
> logical data collections, not hardware.
>
> - A clearer mapping between OSDs and the block device(s) they consume, and
> some additional metadata on those devices (in this case, the
> /dev/disk/by-path string ought to suffice).  I know John has done some
> work here but I think things are still a bit ad hoc.  For example, the osd
> metadata reporting for bluestore devices is pretty unstructured.  (We also
> want a clearer device list and properties for devices for the SMART
> data reporting.)
>
> - A daemon (e.g., ceph-osd-autoreplace) that runs on each machine or a
> tool that is triggered by udev.  It would check for new, empty devices
> appearing in the locations (as defined by the by-path string) previously
> occupied by OSDs that are down.  If that happens, it can use 'ceph osd
> safe-to-destroy' to verify whether it is safe to automatically rebuild
> that OSD.  (If not, it might want to raise a health alert, since it's
> possible the drive that was physically pulled should be preserved until
> the cluster is sure it doesn't need it.)

It's a neat idea... I'm trying to get over my instinctive discomfort
with tools that format drives without administrator intervention!

At the point that we have a daemon that is detecting new devices, it
might be better to have it report those devices to something central,
so that we can prompt the user with a "I see a new drive that looks
like a replacement, shall we go for it?" and the user can either say
yes, or flag that the drive should be ignored by Ceph.

Building such a daemon feels like quite a significant step: once it's
there it would be awfully tempting to use it for other things.  It
depends whether we want to own that piece, or whether we would rather
hold out for container environments that can report drives to us and
thereby avoid the need for our own drive detecting daemon.

John

>
> ?
>
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Wish list : automatic rebuild with hot swap osd ?
  2017-10-19 12:14       ` Alfredo Deza
@ 2017-10-19 13:08         ` Willem Jan Withagen
  2017-10-19 14:53           ` Alan Somers
  0 siblings, 1 reply; 9+ messages in thread
From: Willem Jan Withagen @ 2017-10-19 13:08 UTC (permalink / raw)
  To: Alfredo Deza, Sage Weil; +Cc: alan somers, Yoann Moulin, ceph-devel, John Spray

On 19-10-2017 14:14, Alfredo Deza wrote:
> If these properties also mean device information (vendor, size,
> solid/rotational, etc...) it could help
> to better map/detect an OSD replacement since clusters tend to have a
> certain level of
> homogeneous hardware: if $brand, and $size, and $rotational etc...
> 
>>
>> - A daemon (e.g., ceph-osd-autoreplace) that runs on each machine or a
>> tool that is triggered by udev.  It would check for new, empty devices
>> appearing in the locations (as defined by the by-path string) previously
>> occupied by OSDs that are down.  If that happens, it can use 'ceph osd
>> safe-to-destroy' to verify whether it is safe to automatically rebuild
>> that OSD.  (If not, it might want to raise a health alert, since it's
>> possible the drive that was physically pulled should be preserved until
>> the cluster is sure it doesn't need it.)
> 
> systemd has some support for devices, so we might not even need a
> daemon, but more a unit that can
> depend on events already handled by systemd (would save us from udev).

FreeBSD does not have systemd. 8-)

I'm inclined to say luckily, but then that may be my personal bias.
I don't like "automagic" tools like Udev or systemd tinkering with my 
disks.

As Alan says, in ZFS one can designate hot-standby. But even there I 
prefer to be alerted and then manually intervene.

A hot-swap daemon that gets instructed to only use explicitly and fully 
enumerated disk might be something to trust. So something matching 
disk-serial number would be oke.

--WjW

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Wish list : automatic rebuild with hot swap osd ?
  2017-10-19 13:08         ` Willem Jan Withagen
@ 2017-10-19 14:53           ` Alan Somers
  2017-10-20 16:58             ` Yoann Moulin
  0 siblings, 1 reply; 9+ messages in thread
From: Alan Somers @ 2017-10-19 14:53 UTC (permalink / raw)
  To: Willem Jan Withagen
  Cc: Alfredo Deza, Sage Weil, Yoann Moulin, ceph-devel, John Spray

On Thu, Oct 19, 2017 at 7:08 AM, Willem Jan Withagen <wjw@digiware.nl> wrote:
> On 19-10-2017 14:14, Alfredo Deza wrote:
>>
>> If these properties also mean device information (vendor, size,
>> solid/rotational, etc...) it could help
>> to better map/detect an OSD replacement since clusters tend to have a
>> certain level of
>> homogeneous hardware: if $brand, and $size, and $rotational etc...
>>
>>>
>>> - A daemon (e.g., ceph-osd-autoreplace) that runs on each machine or a
>>> tool that is triggered by udev.  It would check for new, empty devices
>>> appearing in the locations (as defined by the by-path string) previously
>>> occupied by OSDs that are down.  If that happens, it can use 'ceph osd
>>> safe-to-destroy' to verify whether it is safe to automatically rebuild
>>> that OSD.  (If not, it might want to raise a health alert, since it's
>>> possible the drive that was physically pulled should be preserved until
>>> the cluster is sure it doesn't need it.)
>>
>>
>> systemd has some support for devices, so we might not even need a
>> daemon, but more a unit that can
>> depend on events already handled by systemd (would save us from udev).
>
>
> FreeBSD does not have systemd. 8-)
>
> I'm inclined to say luckily, but then that may be my personal bias.
> I don't like "automagic" tools like Udev or systemd tinkering with my disks.
>
> As Alan says, in ZFS one can designate hot-standby. But even there I prefer
> to be alerted and then manually intervene.

Actually, I was talking about autoreplace by physical path.  Hot
spares are something else.  The physical path of a drive is distinct
from its device path.  The physical path is determined by information
from a SES[1] expander, which can actually tell you which physical
slots contain which logical drives.

>
> A hot-swap daemon that gets instructed to only use explicitly and fully
> enumerated disk might be something to trust. So something matching
> disk-serial number would be oke.

Matching disk serial number isn't always safe in a VM.  VMs can
generate duplicate serial numbers.  Better to match against a GPT
label or something that identifies a drive as belonging to Ceph.
That, unfortunately, requires some intervention from the
administrator.  The nice thing about a user space daemon is that its
behavior can easily be controlled by the sysadmin.  So for example, a
sysadmin could opt into a rule that says "Ceph can take over all SCSI
disks" or "Ceph can take over all disks without an existing partition
table or known filesystem".

-Alan

[1] https://en.wikipedia.org/wiki/SCSI_Enclosure_Services

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Wish list : automatic rebuild with hot swap osd ?
  2017-10-19 14:53           ` Alan Somers
@ 2017-10-20 16:58             ` Yoann Moulin
  0 siblings, 0 replies; 9+ messages in thread
From: Yoann Moulin @ 2017-10-20 16:58 UTC (permalink / raw)
  To: Alan Somers, Willem Jan Withagen
  Cc: Alfredo Deza, Sage Weil, ceph-devel, John Spray

Hello,

>>> If these properties also mean device information (vendor, size,
>>> solid/rotational, etc...) it could help
>>> to better map/detect an OSD replacement since clusters tend to have a
>>> certain level of
>>> homogeneous hardware: if $brand, and $size, and $rotational etc...
>>>
>>>>
>>>> - A daemon (e.g., ceph-osd-autoreplace) that runs on each machine or a
>>>> tool that is triggered by udev.  It would check for new, empty devices
>>>> appearing in the locations (as defined by the by-path string) previously
>>>> occupied by OSDs that are down.  If that happens, it can use 'ceph osd
>>>> safe-to-destroy' to verify whether it is safe to automatically rebuild
>>>> that OSD.  (If not, it might want to raise a health alert, since it's
>>>> possible the drive that was physically pulled should be preserved until
>>>> the cluster is sure it doesn't need it.)
>>>
>>>
>>> systemd has some support for devices, so we might not even need a
>>> daemon, but more a unit that can
>>> depend on events already handled by systemd (would save us from udev).
>>
>>
>> FreeBSD does not have systemd. 8-)
>>
>> I'm inclined to say luckily, but then that may be my personal bias.
>> I don't like "automagic" tools like Udev or systemd tinkering with my disks.
>>
>> As Alan says, in ZFS one can designate hot-standby. But even there I prefer
>> to be alerted and then manually intervene.
> 
> Actually, I was talking about autoreplace by physical path.  Hot
> spares are something else.  The physical path of a drive is distinct
> from its device path.  The physical path is determined by information
> from a SES[1] expander, which can actually tell you which physical
> slots contain which logical drives.
> 
>>
>> A hot-swap daemon that gets instructed to only use explicitly and fully
>> enumerated disk might be something to trust. So something matching
>> disk-serial number would be oke.
> 
> Matching disk serial number isn't always safe in a VM.  VMs can
> generate duplicate serial numbers.  Better to match against a GPT
> label or something that identifies a drive as belonging to Ceph.
> That, unfortunately, requires some intervention from the
> administrator.  The nice thing about a user space daemon is that its
> behavior can easily be controlled by the sysadmin.  So for example, a
> sysadmin could opt into a rule that says "Ceph can take over all SCSI
> disks" or "Ceph can take over all disks without an existing partition
> table or known filesystem".

I like the idea of mutiple level of "take over".

To begin, we can imagine something like :

there is a daemon which check if a new disk device is added to an OSD server, a quick analyses of the disk tell if there is a partition table,
which one it is, if there is partitions, if there is zero or garbage (like encrypted data) etc... then  make a resume of the new device to a mon
or mgr or a new daemon

if an OSD is marked as down/out, give the possibility to replace the osd mark as down by a single command (same place in the crush map same weight

  ceph osd <id> replaceby <new disk ref>

if no OSD mark as down, offers the ability to add a new OSD at the right place in the crush map if possible (maybe calculate the place of the
new disk can be difficult)

  ceph osd add <new disk ref> [crush map placement]

of just ignore the disk

  ceph osd ignore <new disk ref>

then give the ability of the cluster in some condition to do choose what to do without human interaction.

does this make sense ?


-- 
Yoann Moulin
EPFL IC-IT

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-10-20 16:58 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-17 21:06 Wish list : automatic rebuild with hot swap osd ? Yoann Moulin
2017-10-17 21:22 ` Sage Weil
2017-10-18 16:02   ` alan somers
2017-10-18 16:25     ` Sage Weil
2017-10-19 12:14       ` Alfredo Deza
2017-10-19 13:08         ` Willem Jan Withagen
2017-10-19 14:53           ` Alan Somers
2017-10-20 16:58             ` Yoann Moulin
2017-10-19 12:30       ` John Spray

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.