All of lore.kernel.org
 help / color / mirror / Atom feed
* Supplying ID to ceph-disk when creating OSD
@ 2017-02-15 16:59 Wido den Hollander
  2017-02-15 17:12 ` Loic Dachary
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Wido den Hollander @ 2017-02-15 16:59 UTC (permalink / raw)
  To: ceph-devel

Hi,

Currently we can supply a OSD UUID to 'ceph-disk prepare', but we can't provide a OSD ID.

With BlueStore coming I think the use-case for this is becoming very valid:

1. Stop OSD
2. Zap disk
3. Re-create OSD with same ID and UUID (with BlueStore)
4. Start OSD

This allows for a in-place update of the OSD without modifying the CRUSHMap. For the cluster's point of view the OSD goes down and comes back up empty.

There were some drawbacks around this and some dangers, so before I start working on a PR for this, any gotcaches which might be a problem?

The idea is that users have a very simple way to re-format a OSD in-place while keeping the same CRUSH location, ID and UUID.

Wido

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Supplying ID to ceph-disk when creating OSD
  2017-02-15 16:59 Supplying ID to ceph-disk when creating OSD Wido den Hollander
@ 2017-02-15 17:12 ` Loic Dachary
  2017-02-15 17:14 ` Sage Weil
  2017-02-15 17:16 ` Gregory Farnum
  2 siblings, 0 replies; 12+ messages in thread
From: Loic Dachary @ 2017-02-15 17:12 UTC (permalink / raw)
  To: Wido den Hollander, ceph-devel

Hi Wido,

On 02/15/2017 05:59 PM, Wido den Hollander wrote:
> Hi,
> 
> Currently we can supply a OSD UUID to 'ceph-disk prepare', but we can't provide a OSD ID.
> 
> With BlueStore coming I think the use-case for this is becoming very valid:
> 
> 1. Stop OSD
> 2. Zap disk
> 3. Re-create OSD with same ID and UUID (with BlueStore)
> 4. Start OSD
> 
> This allows for a in-place update of the OSD without modifying the CRUSHMap. For the cluster's point of view the OSD goes down and comes back up empty.
> 
> There were some drawbacks around this and some dangers, so before I start working on a PR for this, any gotcaches which might be a problem?
> 
> The idea is that users have a very simple way to re-format a OSD in-place while keeping the same CRUSH location, ID and UUID.

Since the mapping UUID <> id is kept in the monitor, re-using the UUID should give you the same OSD id. That is, unless your remove the OSD from the monitor but that's not what you're trying to do right ?

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Supplying ID to ceph-disk when creating OSD
  2017-02-15 16:59 Supplying ID to ceph-disk when creating OSD Wido den Hollander
  2017-02-15 17:12 ` Loic Dachary
@ 2017-02-15 17:14 ` Sage Weil
  2017-02-15 17:16   ` Sage Weil
  2017-02-16 15:14   ` Wido den Hollander
  2017-02-15 17:16 ` Gregory Farnum
  2 siblings, 2 replies; 12+ messages in thread
From: Sage Weil @ 2017-02-15 17:14 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: ceph-devel

On Wed, 15 Feb 2017, Wido den Hollander wrote:
> Hi,
> 
> Currently we can supply a OSD UUID to 'ceph-disk prepare', but we can't 
> provide a OSD ID.
> 
> With BlueStore coming I think the use-case for this is becoming very 
> valid:
> 
> 1. Stop OSD
> 2. Zap disk
> 3. Re-create OSD with same ID and UUID (with BlueStore)
> 4. Start OSD
> 
> This allows for a in-place update of the OSD without modifying the 
> CRUSHMap. For the cluster's point of view the OSD goes down and comes 
> back up empty.
> 
> There were some drawbacks around this and some dangers, so before I 
> start working on a PR for this, any gotcaches which might be a problem?
> 
> The idea is that users have a very simple way to re-format a OSD 
> in-place while keeping the same CRUSH location, ID and UUID.

+1

However, I don't think we need to specify the osd id.. just the uuid.  If 
you pass an existing uuid to 'osd create' it will give you back the 
existing osd id.  Please test to confirm, but I *think* it is sufficient 
to just give ceph-disk prepare the old osd's uuid.

Maybe the thing to do is create a streamlined command to do this: 
'ceph-disk prepare --zap-and-reformat' or something that grabs the old 
uuid for you, does the zap, and then feeds it to prepare?

sage


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Supplying ID to ceph-disk when creating OSD
  2017-02-15 17:14 ` Sage Weil
@ 2017-02-15 17:16   ` Sage Weil
  2017-02-16 15:14   ` Wido den Hollander
  1 sibling, 0 replies; 12+ messages in thread
From: Sage Weil @ 2017-02-15 17:16 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: ceph-devel

On Wed, 15 Feb 2017, Sage Weil wrote:
> On Wed, 15 Feb 2017, Wido den Hollander wrote:
> > Hi,
> > 
> > Currently we can supply a OSD UUID to 'ceph-disk prepare', but we can't 
> > provide a OSD ID.
> > 
> > With BlueStore coming I think the use-case for this is becoming very 
> > valid:
> > 
> > 1. Stop OSD
> > 2. Zap disk
> > 3. Re-create OSD with same ID and UUID (with BlueStore)
> > 4. Start OSD
> > 
> > This allows for a in-place update of the OSD without modifying the 
> > CRUSHMap. For the cluster's point of view the OSD goes down and comes 
> > back up empty.
> > 
> > There were some drawbacks around this and some dangers, so before I 
> > start working on a PR for this, any gotcaches which might be a problem?
> > 
> > The idea is that users have a very simple way to re-format a OSD 
> > in-place while keeping the same CRUSH location, ID and UUID.
> 
> +1
> 
> However, I don't think we need to specify the osd id.. just the uuid.  If 
> you pass an existing uuid to 'osd create' it will give you back the 
> existing osd id.  Please test to confirm, but I *think* it is sufficient 
> to just give ceph-disk prepare the old osd's uuid.
> 
> Maybe the thing to do is create a streamlined command to do this: 
> 'ceph-disk prepare --zap-and-reformat' or something that grabs the old 
> uuid for you, does the zap, and then feeds it to prepare?

Hmm, alternatively, maybe we want something a bit more generic, like

 ceph-disk prepare --replace-osd-id NNN ...

and have that fetch the uuid to use from the monitor.  That way it's the 
same process if you yank out a failed disk, plug a new one in, and want to 
rebuild it in place.

sage

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Supplying ID to ceph-disk when creating OSD
  2017-02-15 16:59 Supplying ID to ceph-disk when creating OSD Wido den Hollander
  2017-02-15 17:12 ` Loic Dachary
  2017-02-15 17:14 ` Sage Weil
@ 2017-02-15 17:16 ` Gregory Farnum
  2017-02-15 17:34   ` Sage Weil
  2 siblings, 1 reply; 12+ messages in thread
From: Gregory Farnum @ 2017-02-15 17:16 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: ceph-devel

On Wed, Feb 15, 2017 at 8:59 AM, Wido den Hollander <wido@42on.com> wrote:
> Hi,
>
> Currently we can supply a OSD UUID to 'ceph-disk prepare', but we can't provide a OSD ID.
>
> With BlueStore coming I think the use-case for this is becoming very valid:
>
> 1. Stop OSD
> 2. Zap disk
> 3. Re-create OSD with same ID and UUID (with BlueStore)
> 4. Start OSD
>
> This allows for a in-place update of the OSD without modifying the CRUSHMap. For the cluster's point of view the OSD goes down and comes back up empty.
>
> There were some drawbacks around this and some dangers, so before I start working on a PR for this, any gotcaches which might be a problem?

Yes. Unfortunately they are subtle and I don't remember them. :p

I'd recommend going back and finding the historical discussions about
this to be sure. I *think* there were two main issues which prompted
us to remove that:
1) people creating very large IDs, needlessly exploding OSDMap size
because it's all array-based,
2) issues reusing the ID of lost OSDs versus PGs recognizing that the
OSD didn't have the data they wanted.

1 is still a bit of a problem, though if anybody has a good UX way of
handling it that's the real issue. 2 has hopefully been fixed over the
course of various refactors and improvements, but it's not something
I'd count on without checking very carefully.
-Greg

>
> The idea is that users have a very simple way to re-format a OSD in-place while keeping the same CRUSH location, ID and UUID.
>
> Wido
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Supplying ID to ceph-disk when creating OSD
  2017-02-15 17:16 ` Gregory Farnum
@ 2017-02-15 17:34   ` Sage Weil
  2017-02-15 18:13     ` Gregory Farnum
  0 siblings, 1 reply; 12+ messages in thread
From: Sage Weil @ 2017-02-15 17:34 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Wido den Hollander, ceph-devel

On Wed, 15 Feb 2017, Gregory Farnum wrote:
> On Wed, Feb 15, 2017 at 8:59 AM, Wido den Hollander <wido@42on.com> wrote:
> > Hi,
> >
> > Currently we can supply a OSD UUID to 'ceph-disk prepare', but we can't provide a OSD ID.
> >
> > With BlueStore coming I think the use-case for this is becoming very valid:
> >
> > 1. Stop OSD
> > 2. Zap disk
> > 3. Re-create OSD with same ID and UUID (with BlueStore)
> > 4. Start OSD
> >
> > This allows for a in-place update of the OSD without modifying the CRUSHMap. For the cluster's point of view the OSD goes down and comes back up empty.
> >
> > There were some drawbacks around this and some dangers, so before I start working on a PR for this, any gotcaches which might be a problem?
> 
> Yes. Unfortunately they are subtle and I don't remember them. :p
> 
> I'd recommend going back and finding the historical discussions about
> this to be sure. I *think* there were two main issues which prompted
> us to remove that:
> 1) people creating very large IDs, needlessly exploding OSDMap size
> because it's all array-based,

Working in terms of uuids should avoid this (i.e., users can't 
force a large oid id without significant effort).

> 2) issues reusing the ID of lost OSDs versus PGs recognizing that the
> OSD didn't have the data they wanted.
> 
> 1 is still a bit of a problem, though if anybody has a good UX way of
> handling it that's the real issue. 2 has hopefully been fixed over the
> course of various refactors and improvements, but it's not something
> I'd count on without checking very carefully.

Oh yeah, this is the one that worries me.  I think the scenario we want 
to help users avoid is that osd N exists (but might be down at the moment) 
and a new, empty version of that same OSD is created and started.  
Peering will reasonably conclude that PG instances don't exist and may 
end up concluding that writes didn't happen.

I think we want some sort of safety check so that the user as to say "this 
osd is dead" before they're allowed to create a new one in its image.  I 
think the simplest thing is to use the existing 'ceph osd lost ...' 
command for this.  I.e., the mon won't let a blank OSD start with a given 
uuid/id unless it is either a new osd rank or the rank is marked 
lost.

My main lingering doubt here is whether it's a bad idea to reuse a uuid; 
it seems like the whole point is that uuids are unique.  Perhaps instead 
the ceph-disk prepare --replace-oid NN command should replace the old uuid 
in the map with the new one as part of this process.  Probably something 
like 'ceph osd replace newuuid olduuid' to make the whole thing 
idempotent...

sage



> -Greg
> 
> >
> > The idea is that users have a very simple way to re-format a OSD in-place while keeping the same CRUSH location, ID and UUID.
> >
> > Wido
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Supplying ID to ceph-disk when creating OSD
  2017-02-15 17:34   ` Sage Weil
@ 2017-02-15 18:13     ` Gregory Farnum
  0 siblings, 0 replies; 12+ messages in thread
From: Gregory Farnum @ 2017-02-15 18:13 UTC (permalink / raw)
  To: Sage Weil; +Cc: Wido den Hollander, ceph-devel

On Wed, Feb 15, 2017 at 9:34 AM, Sage Weil <sage@newdream.net> wrote:
> On Wed, 15 Feb 2017, Gregory Farnum wrote:
>> On Wed, Feb 15, 2017 at 8:59 AM, Wido den Hollander <wido@42on.com> wrote:
>> > Hi,
>> >
>> > Currently we can supply a OSD UUID to 'ceph-disk prepare', but we can't provide a OSD ID.
>> >
>> > With BlueStore coming I think the use-case for this is becoming very valid:
>> >
>> > 1. Stop OSD
>> > 2. Zap disk
>> > 3. Re-create OSD with same ID and UUID (with BlueStore)
>> > 4. Start OSD
>> >
>> > This allows for a in-place update of the OSD without modifying the CRUSHMap. For the cluster's point of view the OSD goes down and comes back up empty.
>> >
>> > There were some drawbacks around this and some dangers, so before I start working on a PR for this, any gotcaches which might be a problem?
>>
>> Yes. Unfortunately they are subtle and I don't remember them. :p
>>
>> I'd recommend going back and finding the historical discussions about
>> this to be sure. I *think* there were two main issues which prompted
>> us to remove that:
>> 1) people creating very large IDs, needlessly exploding OSDMap size
>> because it's all array-based,
>
> Working in terms of uuids should avoid this (i.e., users can't
> force a large oid id without significant effort).
>
>> 2) issues reusing the ID of lost OSDs versus PGs recognizing that the
>> OSD didn't have the data they wanted.
>>
>> 1 is still a bit of a problem, though if anybody has a good UX way of
>> handling it that's the real issue. 2 has hopefully been fixed over the
>> course of various refactors and improvements, but it's not something
>> I'd count on without checking very carefully.
>
> Oh yeah, this is the one that worries me.  I think the scenario we want
> to help users avoid is that osd N exists (but might be down at the moment)
> and a new, empty version of that same OSD is created and started.
> Peering will reasonably conclude that PG instances don't exist and may
> end up concluding that writes didn't happen.
>
> I think we want some sort of safety check so that the user as to say "this
> osd is dead" before they're allowed to create a new one in its image.  I
> think the simplest thing is to use the existing 'ceph osd lost ...'
> command for this.  I.e., the mon won't let a blank OSD start with a given
> uuid/id unless it is either a new osd rank or the rank is marked
> lost.

Certainly that's a good target and using "ceph osd lost" is *supposed*
to handle this case. But I remember seeing reports that even after
running "lost" — certainly after and I think before they reused the ID
— the OSDs were still waiting for data from the new incarnation. It
worried me and some of them were addressed but I don't know if the
issues have been resolved as I think it's an area of light testing.

>
> My main lingering doubt here is whether it's a bad idea to reuse a uuid;
> it seems like the whole point is that uuids are unique.  Perhaps instead
> the ceph-disk prepare --replace-oid NN command should replace the old uuid
> in the map with the new one as part of this process.  Probably something
> like 'ceph osd replace newuuid olduuid' to make the whole thing
> idempotent...
>
> sage

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Supplying ID to ceph-disk when creating OSD
  2017-02-15 17:14 ` Sage Weil
  2017-02-15 17:16   ` Sage Weil
@ 2017-02-16 15:14   ` Wido den Hollander
  2017-02-16 15:32     ` Sage Weil
  1 sibling, 1 reply; 12+ messages in thread
From: Wido den Hollander @ 2017-02-16 15:14 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel


> Op 15 februari 2017 om 18:14 schreef Sage Weil <sage@newdream.net>:
> 
> 
> On Wed, 15 Feb 2017, Wido den Hollander wrote:
> > Hi,
> > 
> > Currently we can supply a OSD UUID to 'ceph-disk prepare', but we can't 
> > provide a OSD ID.
> > 
> > With BlueStore coming I think the use-case for this is becoming very 
> > valid:
> > 
> > 1. Stop OSD
> > 2. Zap disk
> > 3. Re-create OSD with same ID and UUID (with BlueStore)
> > 4. Start OSD
> > 
> > This allows for a in-place update of the OSD without modifying the 
> > CRUSHMap. For the cluster's point of view the OSD goes down and comes 
> > back up empty.
> > 
> > There were some drawbacks around this and some dangers, so before I 
> > start working on a PR for this, any gotcaches which might be a problem?
> > 
> > The idea is that users have a very simple way to re-format a OSD 
> > in-place while keeping the same CRUSH location, ID and UUID.
> 
> +1
> 
> However, I don't think we need to specify the osd id.. just the uuid.  If 
> you pass an existing uuid to 'osd create' it will give you back the 
> existing osd id.  Please test to confirm, but I *think* it is sufficient 
> to just give ceph-disk prepare the old osd's uuid.
> 

Ok, so there were a few things going on here:

- My memory which told me it wasn't possible
- Old Journal data
- Cephx issues

What it boils down to is that this is not sufficient:

$ systemctl stop ceph-osd@4
$ cat /var/lib/ceph/osd/ceph-4/fsid
$ umount
$ ceph-disk prepare --zap-disk --osd-uuid 8f3b58f4-ded3-4b50-836e-72745405f482 /dev/sdb

What needs to be done:

$ systemctl stop ceph-osd@4
$ ceph auth del osd.4
$ cat /var/lib/ceph/osd/ceph-4/fsid
$ umount
$ dd if=/dev/zero of=/dev/sdb2 bs=1M count=100
$ ceph-disk prepare --zap-disk --osd-uuid 8f3b58f4-ded3-4b50-836e-72745405f482 /dev/sdb

Zapping a disk only removes the GPT structures and your XFS (filestore's case) will overwrite the previous system.

However, if the partition layout is the same as before the Journal will not be emptied and the OSD will crash during start.

If you go to BlueStore this is not a problem since you overwrite the whole disk:

$ ceph-disk prepare --zap-disk --osd-uuid 8f3b58f4-ded3-4b50-836e-72745405f482 --bluestore /dev/sdb

Also, the old Cephx is not re-used but a new one is registered, so you have to remove the old one first.

> Maybe the thing to do is create a streamlined command to do this: 
> 'ceph-disk prepare --zap-and-reformat' or something that grabs the old 
> uuid for you, does the zap, and then feeds it to prepare?

Probably a good idea, we just need to figure out how to remove the old key. The bootstrap key isn't allowed to do that:

root@echo:~# ceph --id bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring auth list
Error EACCES: access denied
root@echo:~#

The steps it should take:

1. Get OSD UUID
2. Try to unmount the disk (fails if OSD is still running)
3. Remove old Cephx key (how to do so?)
4. Zap the disk
5. Prepare disk with same UUID
6. Add new cephx key
7. Start the OSD

I am not sure on how to do step #3 from a client with the bootstrap-osd keyring though.

Wido

> 
> sage
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Supplying ID to ceph-disk when creating OSD
  2017-02-16 15:14   ` Wido den Hollander
@ 2017-02-16 15:32     ` Sage Weil
  2017-02-16 16:20       ` Sage Weil
  0 siblings, 1 reply; 12+ messages in thread
From: Sage Weil @ 2017-02-16 15:32 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: ceph-devel

On Thu, 16 Feb 2017, Wido den Hollander wrote:
> > Op 15 februari 2017 om 18:14 schreef Sage Weil <sage@newdream.net>:
> > 
> > 
> > On Wed, 15 Feb 2017, Wido den Hollander wrote:
> > > Hi,
> > > 
> > > Currently we can supply a OSD UUID to 'ceph-disk prepare', but we can't 
> > > provide a OSD ID.
> > > 
> > > With BlueStore coming I think the use-case for this is becoming very 
> > > valid:
> > > 
> > > 1. Stop OSD
> > > 2. Zap disk
> > > 3. Re-create OSD with same ID and UUID (with BlueStore)
> > > 4. Start OSD
> > > 
> > > This allows for a in-place update of the OSD without modifying the 
> > > CRUSHMap. For the cluster's point of view the OSD goes down and comes 
> > > back up empty.
> > > 
> > > There were some drawbacks around this and some dangers, so before I 
> > > start working on a PR for this, any gotcaches which might be a problem?
> > > 
> > > The idea is that users have a very simple way to re-format a OSD 
> > > in-place while keeping the same CRUSH location, ID and UUID.
> > 
> > +1
> > 
> > However, I don't think we need to specify the osd id.. just the uuid.  If 
> > you pass an existing uuid to 'osd create' it will give you back the 
> > existing osd id.  Please test to confirm, but I *think* it is sufficient 
> > to just give ceph-disk prepare the old osd's uuid.
> > 
> 
> Ok, so there were a few things going on here:
> 
> - My memory which told me it wasn't possible
> - Old Journal data
> - Cephx issues
> 
> What it boils down to is that this is not sufficient:
> 
> $ systemctl stop ceph-osd@4
> $ cat /var/lib/ceph/osd/ceph-4/fsid
> $ umount
> $ ceph-disk prepare --zap-disk --osd-uuid 8f3b58f4-ded3-4b50-836e-72745405f482 /dev/sdb
> 
> What needs to be done:
> 
> $ systemctl stop ceph-osd@4
> $ ceph auth del osd.4
> $ cat /var/lib/ceph/osd/ceph-4/fsid
> $ umount
> $ dd if=/dev/zero of=/dev/sdb2 bs=1M count=100
> $ ceph-disk prepare --zap-disk --osd-uuid 8f3b58f4-ded3-4b50-836e-72745405f482 /dev/sdb
> 
> Zapping a disk only removes the GPT structures and your XFS (filestore's 
> case) will overwrite the previous system.
> 
> However, if the partition layout is the same as before the Journal will 
> not be emptied and the OSD will crash during start.

This feels like a bug in 'zap'.  Let's make it zero the first 1M of old 
partitions before blowing away the GPT table?

> If you go to BlueStore this is not a problem since you overwrite the 
> whole disk:
> 
> $ ceph-disk prepare --zap-disk --osd-uuid 8f3b58f4-ded3-4b50-836e-72745405f482 --bluestore /dev/sdb
> 
> Also, the old Cephx is not re-used but a new one is registered, so you 
> have to remove the old one first.
> 
> > Maybe the thing to do is create a streamlined command to do this: 
> > 'ceph-disk prepare --zap-and-reformat' or something that grabs the old 
> > uuid for you, does the zap, and then feeds it to prepare?
> 
> Probably a good idea, we just need to figure out how to remove the old 
> key. The bootstrap key isn't allowed to do that:
> 
> root@echo:~# ceph --id bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring auth list
> Error EACCES: access denied
> root@echo:~#
> 
> The steps it should take:
> 
> 1. Get OSD UUID
> 2. Try to unmount the disk (fails if OSD is still running)
> 3. Remove old Cephx key (how to do so?)
> 4. Zap the disk
> 5. Prepare disk with same UUID
> 6. Add new cephx key
> 7. Start the OSD
> 
> I am not sure on how to do step #3 from a client with the bootstrap-osd 
> keyring though.

There is another step here (see my other email) where we should mark the 
osd as lost before we allow it to be replaced.  So,

0. 'ceph osd lost NNN' from a client.admin node

Assuming that is done, then I think the rest of the procedure would be

2. Try to unmount the disk (fails if OSD is still running)
3. zap the disk
4. ceph-disk prepare --replace-osd-id NNN

This would do 'ceph osd replace <osd-id> <new-uuid>' instead of 'ceph 
osd create <uuid>'.  The new mon command would verify that (1) the osd is 
marked as lost (safety check that makes bootstraps ability to do this 
reasonably secure) and (2) change the uuid to new-uuid.  It could also (3) 
remove old cephx keys.  Note that we had a thread about making create do 
this a month or two ago; this might be a good time to fix that too.  The 
idea was that the boostrap permissions are super wonky because they have 
to allow creating new cephx keys and so on.  Instead, we should make a 
single command that does everything (including creating the cephx keys) 
and returns the whole result to ceph-disk in a blob of json (new osd id + 
cephx key).  The replace command could work the same way (including the 
step of removing the old key), and then the allowed commands for 
the bootstrap key would be just 'osd create' and 'osd replace', period.

5. Start the OSD

What do you think?
sage

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Supplying ID to ceph-disk when creating OSD
  2017-02-16 15:32     ` Sage Weil
@ 2017-02-16 16:20       ` Sage Weil
  2017-02-16 21:56         ` Wido den Hollander
  0 siblings, 1 reply; 12+ messages in thread
From: Sage Weil @ 2017-02-16 16:20 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: ceph-devel

On Thu, 16 Feb 2017, Sage Weil wrote:
> There is another step here (see my other email) where we should mark the 
> osd as lost before we allow it to be replaced.  So,
> 
> 0. 'ceph osd lost NNN' from a client.admin node
> 
> Assuming that is done, then I think the rest of the procedure would be
> 
> 2. Try to unmount the disk (fails if OSD is still running)
> 3. zap the disk
> 4. ceph-disk prepare --replace-osd-id NNN
> 
> This would do 'ceph osd replace <osd-id> <new-uuid>' instead of 'ceph 
> osd create <uuid>'.  The new mon command would verify that (1) the osd is 
> marked as lost (safety check that makes bootstraps ability to do this 
> reasonably secure) and (2) change the uuid to new-uuid.  It could also (3) 
> remove old cephx keys.  Note that we had a thread about making create do 
> this a month or two ago; this might be a good time to fix that too.  The 
> idea was that the boostrap permissions are super wonky because they have 
> to allow creating new cephx keys and so on.  Instead, we should make a 
> single command that does everything (including creating the cephx keys) 
> and returns the whole result to ceph-disk in a blob of json (new osd id + 
> cephx key).  The replace command could work the same way (including the 
> step of removing the old key), and then the allowed commands for 
> the bootstrap key would be just 'osd create' and 'osd replace', period.

Okay, I found the other thread:

	http://marc.info/?t=147913846400007&r=1&w=2

which is about the additional work of setting up the lockbox keys for 
dm-crypt as part of osd creation/bootstrap.  I think the way to do this 
properly is to just bite the bullet and make a new command, let's call it 
'osd bootstrap', and have it take all the various [optional] arguments for 
setting up a new osd, including the osd we want to replace (if any).  It 
can do all the right things as far as setting up (or replacing) cephx 
keys, and be a single atomic and idempotent operation so that ceph-disk is 
super simple and the osd-bootstrap key can actually be secure.

This has totally blown up in scope...are you up for it?  In the meantime, 
the zap fix is small and simple and unrelated to the rest.

sage


> 
> 5. Start the OSD
> 
> What do you think?
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Supplying ID to ceph-disk when creating OSD
  2017-02-16 16:20       ` Sage Weil
@ 2017-02-16 21:56         ` Wido den Hollander
  2017-03-06 21:52           ` Sage Weil
  0 siblings, 1 reply; 12+ messages in thread
From: Wido den Hollander @ 2017-02-16 21:56 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel


> Op 16 februari 2017 om 17:20 schreef Sage Weil <sage@newdream.net>:
> 
> 
> On Thu, 16 Feb 2017, Sage Weil wrote:
> > There is another step here (see my other email) where we should mark the 
> > osd as lost before we allow it to be replaced.  So,
> > 
> > 0. 'ceph osd lost NNN' from a client.admin node
> > 
> > Assuming that is done, then I think the rest of the procedure would be
> > 
> > 2. Try to unmount the disk (fails if OSD is still running)
> > 3. zap the disk
> > 4. ceph-disk prepare --replace-osd-id NNN
> > 
> > This would do 'ceph osd replace <osd-id> <new-uuid>' instead of 'ceph 
> > osd create <uuid>'.  The new mon command would verify that (1) the osd is 
> > marked as lost (safety check that makes bootstraps ability to do this 
> > reasonably secure) and (2) change the uuid to new-uuid.  It could also (3) 
> > remove old cephx keys.  Note that we had a thread about making create do 
> > this a month or two ago; this might be a good time to fix that too.  The 
> > idea was that the boostrap permissions are super wonky because they have 
> > to allow creating new cephx keys and so on.  Instead, we should make a 
> > single command that does everything (including creating the cephx keys) 
> > and returns the whole result to ceph-disk in a blob of json (new osd id + 
> > cephx key).  The replace command could work the same way (including the 
> > step of removing the old key), and then the allowed commands for 
> > the bootstrap key would be just 'osd create' and 'osd replace', period.
> 
> Okay, I found the other thread:
> 
> 	http://marc.info/?t=147913846400007&r=1&w=2
> 
> which is about the additional work of setting up the lockbox keys for 
> dm-crypt as part of osd creation/bootstrap.  I think the way to do this 
> properly is to just bite the bullet and make a new command, let's call it 
> 'osd bootstrap', and have it take all the various [optional] arguments for 
> setting up a new osd, including the osd we want to replace (if any).  It 
> can do all the right things as far as setting up (or replacing) cephx 
> keys, and be a single atomic and idempotent operation so that ceph-disk is 
> super simple and the osd-bootstrap key can actually be secure.
> 
> This has totally blown up in scope...are you up for it?  In the meantime, 
> the zap fix is small and simple and unrelated to the rest.
> 

Well, yes, that's a bit more work :) I've never worked on that part of the code in the MONs, so that will be new for me. Could be a good thing to try and learn from.

The zap-disk thing is indeed low hanging fruit, created a issue for it: http://tracker.ceph.com/issues/18962

PR should be there shortly.

Wid

> sage
> 
> 
> > 
> > 5. Start the OSD
> > 
> > What do you think?
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Supplying ID to ceph-disk when creating OSD
  2017-02-16 21:56         ` Wido den Hollander
@ 2017-03-06 21:52           ` Sage Weil
  0 siblings, 0 replies; 12+ messages in thread
From: Sage Weil @ 2017-03-06 21:52 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: ceph-devel

On Thu, 16 Feb 2017, Wido den Hollander wrote:
> > Op 16 februari 2017 om 17:20 schreef Sage Weil <sage@newdream.net>:
> > 
> > On Thu, 16 Feb 2017, Sage Weil wrote:
> > > There is another step here (see my other email) where we should mark the 
> > > osd as lost before we allow it to be replaced.  So,
> > > 
> > > 0. 'ceph osd lost NNN' from a client.admin node
> > > 
> > > Assuming that is done, then I think the rest of the procedure would be
> > > 
> > > 2. Try to unmount the disk (fails if OSD is still running)
> > > 3. zap the disk
> > > 4. ceph-disk prepare --replace-osd-id NNN
> > > 
> > > This would do 'ceph osd replace <osd-id> <new-uuid>' instead of 'ceph 
> > > osd create <uuid>'.  The new mon command would verify that (1) the osd is 
> > > marked as lost (safety check that makes bootstraps ability to do this 
> > > reasonably secure) and (2) change the uuid to new-uuid.  It could also (3) 
> > > remove old cephx keys.  Note that we had a thread about making create do 
> > > this a month or two ago; this might be a good time to fix that too.  The 
> > > idea was that the boostrap permissions are super wonky because they have 
> > > to allow creating new cephx keys and so on.  Instead, we should make a 
> > > single command that does everything (including creating the cephx keys) 
> > > and returns the whole result to ceph-disk in a blob of json (new osd id + 
> > > cephx key).  The replace command could work the same way (including the 
> > > step of removing the old key), and then the allowed commands for 
> > > the bootstrap key would be just 'osd create' and 'osd replace', period.
> > 
> > Okay, I found the other thread:
> > 
> > 	http://marc.info/?t=147913846400007&r=1&w=2
> > 
> > which is about the additional work of setting up the lockbox keys for 
> > dm-crypt as part of osd creation/bootstrap.  I think the way to do this 
> > properly is to just bite the bullet and make a new command, let's call it 
> > 'osd bootstrap', and have it take all the various [optional] arguments for 
> > setting up a new osd, including the osd we want to replace (if any).  It 
> > can do all the right things as far as setting up (or replacing) cephx 
> > keys, and be a single atomic and idempotent operation so that ceph-disk is 
> > super simple and the osd-bootstrap key can actually be secure.
> > 
> > This has totally blown up in scope...are you up for it?  In the meantime, 
> > the zap fix is small and simple and unrelated to the rest.
> > 
> 
> Well, yes, that's a bit more work :) I've never worked on that part of 
> the code in the MONs, so that will be new for me. Could be a good thing 
> to try and learn from.

I wrote up some notes on an etherpad:

	http://pad.ceph.com/p/osd-replacement

A few changes:

- I think 'ceph osd rm' would make more sense than 'ceph osd lost' for the 
safety gate.
- The 'osd create' replacement command has some notes.  It includes the 
ability to set some arbitrary config-key values so that it can be used for 
the dmcrypt lockbox (or hopefully future things too).

sage

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-03-06 21:52 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-15 16:59 Supplying ID to ceph-disk when creating OSD Wido den Hollander
2017-02-15 17:12 ` Loic Dachary
2017-02-15 17:14 ` Sage Weil
2017-02-15 17:16   ` Sage Weil
2017-02-16 15:14   ` Wido den Hollander
2017-02-16 15:32     ` Sage Weil
2017-02-16 16:20       ` Sage Weil
2017-02-16 21:56         ` Wido den Hollander
2017-03-06 21:52           ` Sage Weil
2017-02-15 17:16 ` Gregory Farnum
2017-02-15 17:34   ` Sage Weil
2017-02-15 18:13     ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.