All of lore.kernel.org
 help / color / mirror / Atom feed
* crush_location hook vs calamari
       [not found] ` <CAGd4Wr2+AD2J690cdcSwknLhUX16SR+WRQDVpn=+UB9mnO0-Yg@mail.gmail.com>
@ 2015-01-16 17:39   ` Sage Weil
       [not found]     ` <CAOWd=nCSAqmSok7+sXm8VqZDUWbdsBDiLovzcDVUXk3018pQ2w@mail.gmail.com>
  0 siblings, 1 reply; 6+ messages in thread
From: Sage Weil @ 2015-01-16 17:39 UTC (permalink / raw)
  To: John Spray; +Cc: Gregory Meno, Dan Mick, John Spray, ceph-devel, ceph-calamari

[adding ceph-devel, ceph-calamari]

On Fri, 16 Jan 2015, John Spray wrote:
> Ideally we would have a solution that preserved the OSD hot-plugging
> ability (it's a neat feature).
> 
> Perhaps the crush location logic should be:
> * If nobody ever overrode me, default behaviour
> * If someone (calamari) set an explicit location, preserve that
> * UNLESS I am on a different hostname than I was when the explicit
> location was set, in which case kick in the hotplug behaviour

This would be nice...

> The hotplug path might just be to reset my location in the existing
> way, or if calamari was really clever it could define how to handle a
> hostname change within a different root (typically the 'ssd' root
> people create) such that if I unplugged ssd_root->myhost_ssd and
> plugged it into foohost, then it would reset its crush location to
> ssd_root->foohost_ssd instead of root->foohost.
> 
> We might want to consider adding a flag into the crush map itself so
> that nodes can be "locked" to indicate that their location was set by
> human intent rather than the crush-location script.

Perhaps a per-osd flag in the OSDMap?  We have a field for this right now, 
although none of the fields are user-modifiable (they are things up up and 
exists).  I think that makes the most sense.

We may also be able to avoid the pain in some cases if we bite the bullet 
and standardize how to handle parallel hdd vs ssd vs whatever trees.  Two 
approaches have come up that come to mind:

1) Make a tree like

 root ssd
     host host1:ssd
         osd.0
         osd.1
     host host2:ssd
         osd.2
         osd.3
 root sata
     host host1:sata
         osd.4
         osd.5
     host host2:sata
         osd.6
         osd.7

where we 'standardize' (by convention) : as a separator between name and 
device type.  Then we could modify the crush location process to take a 
'host=host1' location and current host of host1:ssd as a match and make no 
change.

2) Make the per-type tree generation programatic.  So you would build a 
single tree like this:

 root default
     host host1
         devicetype ssd
             osd.0
             osd.1
         devicetype hdd
             osd.4
             osd.5
     host host2
         devicetype ssd
             osd.2
             osd.3
         devicetype hdd
             osd.6
             osd.7

and then on any map change a function in the monitor would programatically 
create a set of per-type trees in the same map:

 root default
     host host1
         devicetype ssd
             osd.0
             osd.1
         devicetype hdd
             osd.4
             osd.5
     host host2
         devicetype ssd
             osd.2
             osd.3
         devicetype hdd
             osd.6
             osd.7
 root default-devicetype:ssd
     host host1-devicetype:ssd
         osd.0
         osd.1
     host host2-devicetype:ssd
         osd.2
         osd.3
 root default-devicetype:hdd
     host host1-devicetype:hdd
         osd.4
         osd.5
     host host2-devicetype:hdd
         osd.6
         osd.7

The nice thing about this is the crush location script goes on specifying 
the same thing it does now, like host=host1 rack=rack1 etc.  The only 
thing we add is a devicetype=ssd or hdd, perhaps based on was we glean 
from the /sys/block/* (e.g., there is a 'rotating' flag in there to help 
identify SSDs).  Rules that use 'default' will see no change.  But if this 
feature is enabled and we start generating trees based on the 'devicetype' 
crush type we'll get a new set of automagic roots that rules can use 
instead.

This doesn't really address the Calamari problem, though... but it would 
solve one of the main use-cases for customizing the map, I think?

sage







> 
> John
> 
> On Fri, Jan 16, 2015 at 2:14 PM, Gregory Meno <gmeno@redhat.com> wrote:
> > The problem I am trying to solve is:
> > Calamari now has the ability to manage the crush map and for that to be
> > useful I need to prevent the default behavior of OSDs set update on start.
> >
> > The config surrounding crush_location seems complicated enough that I want
> > some help deciding on the best approach.
> >
> > http://tracker.ceph.com/issues/8667 contains the background info
> >
> > options:
> > - Calamari sets "osd update on start to false" on all OSDs it manages.
> >
> > - Calamari sets "osd crush location hook" on all OSDs it manages
> >
> > criteria:
> >
> > - don't piss off admins with existing clusters and configs
> >
> > - solution applies after life-cycle requires addition of new OSDs
> >
> > - ??? am I missing more
> >
> > comparison:
> >  TBD
> >
> > recommendation:
> >
> > after talking to Dan the solution that seems best is:
> >
> > Have calamari set "osd crush location hook" to a script that asks either
> > calamari or the cluster for the OSDs last known location in the CRUSH map if
> > this is a new OSD fallback to a sensible default e.g. the behavior as it
> > "osd update on start" were true
> >
> > The thing I like most about this approach is that we edit the config file
> > one time.
> >
> > regards,
> > Gregory
> 
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Fwd: crush_location hook vs calamari
       [not found]     ` <CAOWd=nCSAqmSok7+sXm8VqZDUWbdsBDiLovzcDVUXk3018pQ2w@mail.gmail.com>
@ 2015-01-19 16:12       ` Gregory Meno
  2015-01-19 16:18         ` Sage Weil
  0 siblings, 1 reply; 6+ messages in thread
From: Gregory Meno @ 2015-01-19 16:12 UTC (permalink / raw)
  To: ceph-devel

On Fri, Jan 16, 2015 at 12:39 PM, Sage Weil <sweil@redhat.com> wrote:
>
> [adding ceph-devel, ceph-calamari]
>
> On Fri, 16 Jan 2015, John Spray wrote:
> > Ideally we would have a solution that preserved the OSD hot-plugging
> > ability (it's a neat feature).
> >
> > Perhaps the crush location logic should be:
> > * If nobody ever overrode me, default behaviour
> > * If someone (calamari) set an explicit location, preserve that
> > * UNLESS I am on a different hostname than I was when the explicit
> > location was set, in which case kick in the hotplug behaviour
>
> This would be nice...


I agree this sounds fine, and easy enough to explain.

>
>
> > The hotplug path might just be to reset my location in the existing
> > way, or if calamari was really clever it could define how to handle a
> > hostname change within a different root (typically the 'ssd' root
> > people create) such that if I unplugged ssd_root->myhost_ssd and
> > plugged it into foohost, then it would reset its crush location to
> > ssd_root->foohost_ssd instead of root->foohost.
> >
> > We might want to consider adding a flag into the crush map itself so
> > that nodes can be "locked" to indicate that their location was set by
> > human intent rather than the crush-location script.
>
> Perhaps a per-osd flag in the OSDMap?  We have a field for this right now,
> although none of the fields are user-modifiable (they are things up up and
> exists).  I think that makes the most sense.


So If I understand this correctly we are talking about adding data to
the CRUSH map for the crush-location script to read.

It appears to not talk to the cluster presently
ubuntu@vpm148:~$ strace ceph-crush-location --cluster ceph --id 0
--type osd 2>&1 | grep ^open
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/usr/bin/ceph-crush-location", O_RDONLY) = 3

>
>
> We may also be able to avoid the pain in some cases if we bite the bullet
> and standardize how to handle parallel hdd vs ssd vs whatever trees.  Two
> approaches have come up that come to mind:


I have always thought it strange to have multiple trees that duplicate
things with one physical presence. e.g. foohost-spinning-disk and
foohost-SSD. It seems to me if we are going to change things we should
be tagging the OSDs with SSD, I don't think we should restrict the
tagging to them. It seems to be a good option for the future if each
node were able to be tagged in some way. It seems that there are
things on a host that you might care to tag like networking capability
10Gbit vs 1Gbit links.

This suggestion would introduce a need to migrate existing crush-maps
and rules perhaps more than we want. Maybe we take steps in this
direction so we can be here at ceph K series.

>
>
> 1) Make a tree like
>
>  root ssd
>      host host1:ssd
>          osd.0
>          osd.1
>      host host2:ssd
>          osd.2
>          osd.3
>  root sata
>      host host1:sata
>          osd.4
>          osd.5
>      host host2:sata
>          osd.6
>          osd.7
>
> where we 'standardize' (by convention) : as a separator between name and
> device type.  Then we could modify the crush location process to take a
> 'host=host1' location and current host of host1:ssd as a match and make no
> change.
>
> 2) Make the per-type tree generation programatic.  So you would build a
> single tree like this:
>
>  root default
>      host host1
>          devicetype ssd
>              osd.0
>              osd.1
>          devicetype hdd
>              osd.4
>              osd.5
>      host host2
>          devicetype ssd
>              osd.2
>              osd.3
>          devicetype hdd
>              osd.6
>              osd.7
>
> and then on any map change a function in the monitor would programatically
> create a set of per-type trees in the same map:
>
>  root default
>      host host1
>          devicetype ssd
>              osd.0
>              osd.1
>          devicetype hdd
>              osd.4
>              osd.5
>      host host2
>          devicetype ssd
>              osd.2
>              osd.3
>          devicetype hdd
>              osd.6
>              osd.7
>  root default-devicetype:ssd
>      host host1-devicetype:ssd
>          osd.0
>          osd.1
>      host host2-devicetype:ssd
>          osd.2
>          osd.3
>  root default-devicetype:hdd
>      host host1-devicetype:hdd
>          osd.4
>          osd.5
>      host host2-devicetype:hdd
>          osd.6
>          osd.7
>
> The nice thing about this is the crush location script goes on specifying
> the same thing it does now, like host=host1 rack=rack1 etc.  The only
> thing we add is a devicetype=ssd or hdd, perhaps based on was we glean
> from the /sys/block/* (e.g., there is a 'rotating' flag in there to help
> identify SSDs).  Rules that use 'default' will see no change.  But if this
> feature is enabled and we start generating trees based on the 'devicetype'
> crush type we'll get a new set of automagic roots that rules can use
> instead.


Regardless of the representation more smarts about how we discover
these capabilities sounds awesome to me.
>
>
> This doesn't really address the Calamari problem, though... but it would
> solve one of the main use-cases for customizing the map, I think?


You are right about not really addressing calamari. The thing I need
to solve is how to make ceph-crush-location script smart about
coexisting with changes to the crush map.

Gregory


>
> sage
>
>
>
>
>
>
>
> >
> > John
> >
> > On Fri, Jan 16, 2015 at 2:14 PM, Gregory Meno <gmeno@redhat.com> wrote:
> > > The problem I am trying to solve is:
> > > Calamari now has the ability to manage the crush map and for that to be
> > > useful I need to prevent the default behavior of OSDs set update on start.
> > >
> > > The config surrounding crush_location seems complicated enough that I want
> > > some help deciding on the best approach.
> > >
> > > http://tracker.ceph.com/issues/8667 contains the background info
> > >
> > > options:
> > > - Calamari sets "osd update on start to false" on all OSDs it manages.
> > >
> > > - Calamari sets "osd crush location hook" on all OSDs it manages
> > >
> > > criteria:
> > >
> > > - don't piss off admins with existing clusters and configs
> > >
> > > - solution applies after life-cycle requires addition of new OSDs
> > >
> > > - ??? am I missing more
> > >
> > > comparison:
> > >  TBD
> > >
> > > recommendation:
> > >
> > > after talking to Dan the solution that seems best is:
> > >
> > > Have calamari set "osd crush location hook" to a script that asks either
> > > calamari or the cluster for the OSDs last known location in the CRUSH map if
> > > this is a new OSD fallback to a sensible default e.g. the behavior as it
> > > "osd update on start" were true
> > >
> > > The thing I like most about this approach is that we edit the config file
> > > one time.
> > >
> > > regards,
> > > Gregory
> >
> >

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fwd: crush_location hook vs calamari
  2015-01-19 16:12       ` Fwd: " Gregory Meno
@ 2015-01-19 16:18         ` Sage Weil
  2015-01-20 16:40           ` Gregory Meno
  0 siblings, 1 reply; 6+ messages in thread
From: Sage Weil @ 2015-01-19 16:18 UTC (permalink / raw)
  To: Gregory Meno; +Cc: ceph-devel

On Mon, 19 Jan 2015, Gregory Meno wrote:
> On Fri, Jan 16, 2015 at 12:39 PM, Sage Weil <sweil@redhat.com> wrote:
> >
> > [adding ceph-devel, ceph-calamari]
> >
> > On Fri, 16 Jan 2015, John Spray wrote:
> > > Ideally we would have a solution that preserved the OSD hot-plugging
> > > ability (it's a neat feature).
> > >
> > > Perhaps the crush location logic should be:
> > > * If nobody ever overrode me, default behaviour
> > > * If someone (calamari) set an explicit location, preserve that
> > > * UNLESS I am on a different hostname than I was when the explicit
> > > location was set, in which case kick in the hotplug behaviour
> >
> > This would be nice...
> 
> 
> I agree this sounds fine, and easy enough to explain.
> 
> >
> >
> > > The hotplug path might just be to reset my location in the existing
> > > way, or if calamari was really clever it could define how to handle a
> > > hostname change within a different root (typically the 'ssd' root
> > > people create) such that if I unplugged ssd_root->myhost_ssd and
> > > plugged it into foohost, then it would reset its crush location to
> > > ssd_root->foohost_ssd instead of root->foohost.
> > >
> > > We might want to consider adding a flag into the crush map itself so
> > > that nodes can be "locked" to indicate that their location was set by
> > > human intent rather than the crush-location script.
> >
> > Perhaps a per-osd flag in the OSDMap?  We have a field for this right now,
> > although none of the fields are user-modifiable (they are things up up and
> > exists).  I think that makes the most sense.
> 
> 
> So If I understand this correctly we are talking about adding data to
> the CRUSH map for the crush-location script to read.
> 
> It appears to not talk to the cluster presently
> ubuntu@vpm148:~$ strace ceph-crush-location --cluster ceph --id 0
> --type osd 2>&1 | grep ^open
> open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
> open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
> open("/usr/bin/ceph-crush-location", O_RDONLY) = 3

Yeah, and it should stay that way IMO so that it is an simple hook that 
admins can implement to output the k/v pairs.  I think the smarts should 
go in init-ceph where it calls 'ceph osd crush create-or-move'.  We can 
either add a conditional check in the script there (as well as 
ceph-osd-prestart.sh, which upstart and systemd use), or make a new mon 
command that puts the smarts in the mon.

The former is probably simpler, although slightly racy.  It will probably 
need to do 'ceph osd dump -f json' to parse out the per-osd flags and 
check for something.

The latter might be 'ceph osd update-location-on-start <osd.NNN> <k/v 
pairs>'?

> [...]
> You are right about not really addressing calamari. The thing I need
> to solve is how to make ceph-crush-location script smart about
> coexisting with changes to the crush map.

Yep, let's solve that problem first.  :)

sage

> 
> Gregory
> 
> 
> >
> > sage
> >
> >
> >
> >
> >
> >
> >
> > >
> > > John
> > >
> > > On Fri, Jan 16, 2015 at 2:14 PM, Gregory Meno <gmeno@redhat.com> wrote:
> > > > The problem I am trying to solve is:
> > > > Calamari now has the ability to manage the crush map and for that to be
> > > > useful I need to prevent the default behavior of OSDs set update on start.
> > > >
> > > > The config surrounding crush_location seems complicated enough that I want
> > > > some help deciding on the best approach.
> > > >
> > > > http://tracker.ceph.com/issues/8667 contains the background info
> > > >
> > > > options:
> > > > - Calamari sets "osd update on start to false" on all OSDs it manages.
> > > >
> > > > - Calamari sets "osd crush location hook" on all OSDs it manages
> > > >
> > > > criteria:
> > > >
> > > > - don't piss off admins with existing clusters and configs
> > > >
> > > > - solution applies after life-cycle requires addition of new OSDs
> > > >
> > > > - ??? am I missing more
> > > >
> > > > comparison:
> > > >  TBD
> > > >
> > > > recommendation:
> > > >
> > > > after talking to Dan the solution that seems best is:
> > > >
> > > > Have calamari set "osd crush location hook" to a script that asks either
> > > > calamari or the cluster for the OSDs last known location in the CRUSH map if
> > > > this is a new OSD fallback to a sensible default e.g. the behavior as it
> > > > "osd update on start" were true
> > > >
> > > > The thing I like most about this approach is that we edit the config file
> > > > one time.
> > > >
> > > > regards,
> > > > Gregory
> > >
> > >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fwd: crush_location hook vs calamari
  2015-01-19 16:18         ` Sage Weil
@ 2015-01-20 16:40           ` Gregory Meno
  2015-01-22 17:18             ` Sage Weil
  0 siblings, 1 reply; 6+ messages in thread
From: Gregory Meno @ 2015-01-20 16:40 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, ceph-calamari

On Mon, Jan 19, 2015 at 11:18 AM, Sage Weil <sage@newdream.net> wrote:
> On Mon, 19 Jan 2015, Gregory Meno wrote:
>> On Fri, Jan 16, 2015 at 12:39 PM, Sage Weil <sweil@redhat.com> wrote:
>> >
>> > [adding ceph-devel, ceph-calamari]
>> >
>> > On Fri, 16 Jan 2015, John Spray wrote:
>> > > Ideally we would have a solution that preserved the OSD hot-plugging
>> > > ability (it's a neat feature).
>> > >
>> > > Perhaps the crush location logic should be:
>> > > * If nobody ever overrode me, default behaviour
>> > > * If someone (calamari) set an explicit location, preserve that
>> > > * UNLESS I am on a different hostname than I was when the explicit
>> > > location was set, in which case kick in the hotplug behaviour
>> >
>> > This would be nice...
>>
>>
>> I agree this sounds fine, and easy enough to explain.
>>
>> >
>> >
>> > > The hotplug path might just be to reset my location in the existing
>> > > way, or if calamari was really clever it could define how to handle a
>> > > hostname change within a different root (typically the 'ssd' root
>> > > people create) such that if I unplugged ssd_root->myhost_ssd and
>> > > plugged it into foohost, then it would reset its crush location to
>> > > ssd_root->foohost_ssd instead of root->foohost.
>> > >
>> > > We might want to consider adding a flag into the crush map itself so
>> > > that nodes can be "locked" to indicate that their location was set by
>> > > human intent rather than the crush-location script.
>> >
>> > Perhaps a per-osd flag in the OSDMap?  We have a field for this right now,
>> > although none of the fields are user-modifiable (they are things up up and
>> > exists).  I think that makes the most sense.
>>
>>
>> So If I understand this correctly we are talking about adding data to
>> the CRUSH map for the crush-location script to read.
>>
>> It appears to not talk to the cluster presently
>> ubuntu@vpm148:~$ strace ceph-crush-location --cluster ceph --id 0
>> --type osd 2>&1 | grep ^open
>> open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
>> open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
>> open("/usr/bin/ceph-crush-location", O_RDONLY) = 3
>
> Yeah, and it should stay that way IMO so that it is an simple hook that
> admins can implement to output the k/v pairs.  I think the smarts should
> go in init-ceph where it calls 'ceph osd crush create-or-move'.  We can
> either add a conditional check in the script there (as well as
> ceph-osd-prestart.sh, which upstart and systemd use), or make a new mon
> command that puts the smarts in the mon.
>

+1

> The former is probably simpler, although slightly racy.  It will probably
> need to do 'ceph osd dump -f json' to parse out the per-osd flags and
> check for something.
>
> The latter might be 'ceph osd update-location-on-start <osd.NNN> <k/v
> pairs>'?
>
>> [...]
>> You are right about not really addressing calamari. The thing I need
>> to solve is how to make ceph-crush-location script smart about
>> coexisting with changes to the crush map.
>
> Yep, let's solve that problem first.  :)

So I see solving this problem with Calamari is a precursor to
improving the way this is handled in Ceph.

How does this sound:

When Calamari makes a change to the CRUSH map where an OSD gets
reparented to a different CRUSH tree  it stores a set of key-value
pairs and physical host in ceph config-key e.g.

rootA -> hostA -> OSD1, OSD2

becomes

rootA -> hostA -> OSD1

rootB -> hostB -> OSD2

and

ceph config-key get 'calamari:1:osd_crush_location:osd.2' = {'paths':
[[root=rootB, host=hostB]], 'physical_host': hostA}

When the OSD starts up a calamari-specific script sends a mon command
to get the data we persisted in the config-key, if none exists we
return the default crush_path, otherwise if match the physical_host to
the node where this OSD is starting then we return the stored path. If
the host match fails we return the default crush_path so that
hot-plugging continues to work.

and Calamari sets "osd crush location hook" on all OSDs it manages

Gregory


>
> sage
>
>>
>> Gregory
>>
>>
>> >
>> > sage
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > >
>> > > John
>> > >
>> > > On Fri, Jan 16, 2015 at 2:14 PM, Gregory Meno <gmeno@redhat.com> wrote:
>> > > > The problem I am trying to solve is:
>> > > > Calamari now has the ability to manage the crush map and for that to be
>> > > > useful I need to prevent the default behavior of OSDs set update on start.
>> > > >
>> > > > The config surrounding crush_location seems complicated enough that I want
>> > > > some help deciding on the best approach.
>> > > >
>> > > > http://tracker.ceph.com/issues/8667 contains the background info
>> > > >
>> > > > options:
>> > > > - Calamari sets "osd update on start to false" on all OSDs it manages.
>> > > >
>> > > > - Calamari sets "osd crush location hook" on all OSDs it manages
>> > > >
>> > > > criteria:
>> > > >
>> > > > - don't piss off admins with existing clusters and configs
>> > > >
>> > > > - solution applies after life-cycle requires addition of new OSDs
>> > > >
>> > > > - ??? am I missing more
>> > > >
>> > > > comparison:
>> > > >  TBD
>> > > >
>> > > > recommendation:
>> > > >
>> > > > after talking to Dan the solution that seems best is:
>> > > >
>> > > > Have calamari set "osd crush location hook" to a script that asks either
>> > > > calamari or the cluster for the OSDs last known location in the CRUSH map if
>> > > > this is a new OSD fallback to a sensible default e.g. the behavior as it
>> > > > "osd update on start" were true
>> > > >
>> > > > The thing I like most about this approach is that we edit the config file
>> > > > one time.
>> > > >
>> > > > regards,
>> > > > Gregory
>> > >
>> > >
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fwd: crush_location hook vs calamari
  2015-01-20 16:40           ` Gregory Meno
@ 2015-01-22 17:18             ` Sage Weil
  2015-01-22 17:28               ` Sage Weil
  0 siblings, 1 reply; 6+ messages in thread
From: Sage Weil @ 2015-01-22 17:18 UTC (permalink / raw)
  To: Gregory Meno; +Cc: ceph-devel, ceph-calamari

On Tue, 20 Jan 2015, Gregory Meno wrote:
> >> [...]
> >> You are right about not really addressing calamari. The thing I need
> >> to solve is how to make ceph-crush-location script smart about
> >> coexisting with changes to the crush map.
> >
> > Yep, let's solve that problem first.  :)
> 
> So I see solving this problem with Calamari is a precursor to
> improving the way this is handled in Ceph.
> 
> How does this sound:
> 
> When Calamari makes a change to the CRUSH map where an OSD gets
> reparented to a different CRUSH tree  it stores a set of key-value
> pairs and physical host in ceph config-key e.g.
> 
> rootA -> hostA -> OSD1, OSD2
> 
> becomes
> 
> rootA -> hostA -> OSD1
> 
> rootB -> hostB -> OSD2
> 
> and
> 
> ceph config-key get 'calamari:1:osd_crush_location:osd.2' = {'paths':
> [[root=rootB, host=hostB]], 'physical_host': hostA}
> 
> When the OSD starts up a calamari-specific script sends a mon command
> to get the data we persisted in the config-key, if none exists we
> return the default crush_path, otherwise if match the physical_host to
> the node where this OSD is starting then we return the stored path. If
> the host match fails we return the default crush_path so that
> hot-plugging continues to work.
> 
> and Calamari sets "osd crush location hook" on all OSDs it manages

Hmm, with that logic, I think what we have now will actually work 
unmodified?  If the *actual* crush location is, say,

 root=a rack=b host=c

and the hook says

 root=foo rack=bb host=c

it will make no change. It looks for the innermost (by crush type id) 
field and if it matches it's a no-op.  OTOH if the hook says

 root=foo rack=bb host=cc

then it will move it to a new location.  Again, though, we start with the 
innermost fields and stop once there is a match.  So if rack=bb exists but 
under root=bar, we will end up with

 root=bar rack=bb host=cc

because we stop at the first item that is already present 
(rack=bb).

Mainly this means that if we move a host to a new rack the OSDs won't move 
themselves around... the admin needs to adjust the crush map explicitly.

Anwyay, does that look right?

...

If that *doesn't* work, it brings up a couple questions, though...

1) Should this be a 'calamari' override or a generic ceph one?  It could 
go straight into the default hook.  That would simplify things.

2) I have some doubts about whether the crush location update via the init 
script is a good idea.  I have a half-finished patch that move this step 
into the OSD itself so that the init script doesn't block when the mons 
are down; instead, ceph-osd will start (and maybe fork) as usual and then 
retry until the mons become available, do the crush update, and then do 
the rest of its boot sequence.  We also avoid duplicating the 
implementation in the sysvinit script and upstart/systemd helper (which 
IIRC is somewhat awkward to trigger, the original motivation for this 
patch).

sage

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fwd: crush_location hook vs calamari
  2015-01-22 17:18             ` Sage Weil
@ 2015-01-22 17:28               ` Sage Weil
  0 siblings, 0 replies; 6+ messages in thread
From: Sage Weil @ 2015-01-22 17:28 UTC (permalink / raw)
  To: Gregory Meno; +Cc: ceph-devel, ceph-calamari

On Thu, 22 Jan 2015, Sage Weil wrote:
> 2) I have some doubts about whether the crush location update via the init 
> script is a good idea.  I have a half-finished patch that move this step 
> into the OSD itself so that the init script doesn't block when the mons 
> are down; instead, ceph-osd will start (and maybe fork) as usual and then 
> retry until the mons become available, do the crush update, and then do 
> the rest of its boot sequence.  We also avoid duplicating the 
> implementation in the sysvinit script and upstart/systemd helper (which 
> IIRC is somewhat awkward to trigger, the original motivation for this 
> patch).

Nevermind, I remember why this didn't get very far.. the OSD works off the 
crush_location option but that still needs to be filled in by the hook, so 
either the init systems needs to do a --crush-location `...` deal (meh) or 
ceph-osd has to call the hook directly (meh).

sage


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-01-22 17:28 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAOWd=nBExJbD72ereRmn_mv_u=cODRnvDif8RcUiLejcQqKoOg@mail.gmail.com>
     [not found] ` <CAGd4Wr2+AD2J690cdcSwknLhUX16SR+WRQDVpn=+UB9mnO0-Yg@mail.gmail.com>
2015-01-16 17:39   ` crush_location hook vs calamari Sage Weil
     [not found]     ` <CAOWd=nCSAqmSok7+sXm8VqZDUWbdsBDiLovzcDVUXk3018pQ2w@mail.gmail.com>
2015-01-19 16:12       ` Fwd: " Gregory Meno
2015-01-19 16:18         ` Sage Weil
2015-01-20 16:40           ` Gregory Meno
2015-01-22 17:18             ` Sage Weil
2015-01-22 17:28               ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.