All of lore.kernel.org
 help / color / mirror / Atom feed
* Braindump: path names, partition labels, FHS, auto-discovery
@ 2012-03-06 21:19 Tommi Virtanen
  2012-03-06 22:29 ` Greg Farnum
  2012-03-07  9:55 ` David McBride
  0 siblings, 2 replies; 11+ messages in thread
From: Tommi Virtanen @ 2012-03-06 21:19 UTC (permalink / raw)
  To: ceph-devel

As you may have noticed, the docs [1] and Chef cookbooks [2] currently
use /srv/osd.$id and such paths. That's, shall we say, Not Ideal(tm).

[1] http://ceph.newdream.net/docs/latest/ops/install/mkcephfs/#creating-a-ceph-conf-file
[2] https://github.com/ceph/ceph-cookbooks/blob/master/ceph/recipes/bootstrap_osd.rb#L70


I initially used /srv purely because I needed to get them going quick,
and that directory was guaranteed to exist. Let's figure out the long
term goal.

The kinds of things we have:

- configuration, edited by humans (ONLY)
- machine-editable state similar to configuration
- OSD data is typically a dedicated filesystem, accommodate that
- OSD journal can be just about any file, including block devices

OSD journal flexibility is limiting for automation.. support three
major use cases:

- OSD journal may be fixed-basename file inside osd data directory
- OSD journal may be a file on a shared SSD
- OSD journal may be a block device (e.g. full SSD, partition on SSD,
2nd LUN on the same RAID with different tuning)

Requirements:

- FHS compliant: http://www.pathname.com/fhs/
- works well with Debian and RPM packaging
- OSD creation/teardown is completely automated
- ceph.conf is static for the whole cluster; not edited by per-machine
automation
- we're assuming GPT partitions, at least for not

Desirable things:

- ability to isolate daemons from each other more, e.g.
AppArmor/SELinux/different uids; e.g. do not assume all daemons can
mkdir in the same directory (ceph-mon vs ceph-osd)
- ability to move OSD data disk from server A to server B (e.g.
chassis swap due to faulty mother board)


The Plan (ta-daaa!):

(These will be just the defaults -- if you're hand-rolling your setup,
and disagree, just override them.)

(Apologies if this gets sketchy, I haven't had time to distill these
thoughts into something prettier.)

- FHS says human-editable configuration goes in /etc
- FHS says machine-editable state goes in /var/lib/ceph
- use /var/lib/ceph/mon/$id/ for mon.$id
- use /var/lib/ceph/osd-journal/$id for osd.$id journal; symlink to
actual location
- use /var/lib/ceph/osd-data/$id for osd.$id data; may be a symlink to
actual location?
- embed the same random UUID in osd data & osd journal at ceph-osd
mkfs time, for safety

On a disk hot plug event (and at bootup):
- found = {}
- scan the partitions for partition label with the prefix
"ceph-osd-data-". Take the remaining portion as $id and mount the fs
in /var/lib/ceph/osd-data/$id. Add $id to found (TODO handle
pre-existing). if osd-data/$id/journal exists, symlink osd-journal/$id
to it (TODO handle pre-existing).
- scan for partition label with the prefix "ceph-osd-journal-" and
special GUID type. Take the remaining portion as $id and symlink the
block device to /var/lib/ceph/osd-journal/$id. Add $id to found. (TODO
handle pre-existing)
- for each $id in found, if we have both osd-journal and osd-data,
start a ceph-osd for it


Moving journal

As an admin, I want to move an OSD data disk from one physical host
(chassis) to another (e.g. for maintenance of non-hotswap power
supply).
I might have a single SSD, divided into multiple partitions, each
acting as the journal for a single OSD data disk. I want to spread the
load evenly across the rest of the cluster, so I move the OSD data
disks to multiple destination machines, as long as they have 1 slot
free. Naturally, I cannot easily saw the SSD apart and move it
physically.

I would like to be able to:

1. shut down the osd daemon
2. explicitly flush out & invalidate the journal on SSD (after this,
the journal would not be marked with the osd id and fsid anymore)
3. move the HDD
4. on the new host, assign a blank SSD partition and initialize it
with the right fsid etc metadata

It may actually be nicer to think of this as:

1. shut down the osd daemon
2. move the journal inside the osd data dir, invalidate the old one
(flushing it is an optimization)
3. physically move the HDD
4. move the journal from inside the osd data dir to assigned block device

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Braindump: path names, partition labels, FHS, auto-discovery
  2012-03-06 21:19 Braindump: path names, partition labels, FHS, auto-discovery Tommi Virtanen
@ 2012-03-06 22:29 ` Greg Farnum
  2012-03-07  9:55 ` David McBride
  1 sibling, 0 replies; 11+ messages in thread
From: Greg Farnum @ 2012-03-06 22:29 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: ceph-devel

On Tuesday, March 6, 2012 at 1:19 PM, Tommi Virtanen wrote:
> As you may have noticed, the docs [1] and Chef cookbooks [2] currently
> use /srv/osd.$id and such paths. That's, shall we say, Not Ideal(tm).
> 
> [1] http://ceph.newdream.net/docs/latest/ops/install/mkcephfs/#creating-a-ceph-conf-file
> [2] https://github.com/ceph/ceph-cookbooks/blob/master/ceph/recipes/bootstrap_osd.rb#L70
> 
> 
> I initially used /srv purely because I needed to get them going quick,
> and that directory was guaranteed to exist. Let's figure out the long
> term goal.
> 
> The kinds of things we have:
> 
> - configuration, edited by humans (ONLY)
> - machine-editable state similar to configuration
> - OSD data is typically a dedicated filesystem, accommodate that
> - OSD journal can be just about any file, including block devices
> 
> OSD journal flexibility is limiting for automation.. support three
> major use cases:
> 
> - OSD journal may be fixed-basename file inside osd data directory
> - OSD journal may be a file on a shared SSD
> - OSD journal may be a block device (e.g. full SSD, partition on SSD,
> 2nd LUN on the same RAID with different tuning)
> 
> Requirements:
> 
> - FHS compliant: http://www.pathname.com/fhs/
> - works well with Debian and RPM packaging
> - OSD creation/teardown is completely automated
> - ceph.conf is static for the whole cluster; not edited by per-machine
> automation
> - we're assuming GPT partitions, at least for not
> 
> Desirable things:
> 
> - ability to isolate daemons from each other more, e.g.
> AppArmor/SELinux/different uids; e.g. do not assume all daemons can
> mkdir in the same directory (ceph-mon vs ceph-osd)
> - ability to move OSD data disk from server A to server B (e.g.
> chassis swap due to faulty mother board)
> 
> 
> The Plan (ta-daaa!):
> 
> (These will be just the defaults -- if you're hand-rolling your setup,
> and disagree, just override them.)
> 
> (Apologies if this gets sketchy, I haven't had time to distill these
> thoughts into something prettier.)
> 
> - FHS says human-editable configuration goes in /etc
> - FHS says machine-editable state goes in /var/lib/ceph
> - use /var/lib/ceph/mon/$id/ for mon.$id
> - use /var/lib/ceph/osd-journal/$id for osd.$id journal; symlink to
> actual location
> - use /var/lib/ceph/osd-data/$id for osd.$id data; may be a symlink to
> actual location?
> - embed the same random UUID in osd data & osd journal at ceph-osd
> mkfs time, for safety
> 
> On a disk hot plug event (and at bootup):
> - found = {}
> - scan the partitions for partition label with the prefix
> "ceph-osd-data-". Take the remaining portion as $id and mount the fs
> in /var/lib/ceph/osd-data/$id. Add $id to found (TODO handle
> pre-existing). if osd-data/$id/journal exists, symlink osd-journal/$id
> to it (TODO handle pre-existing).
> - scan for partition label with the prefix "ceph-osd-journal-" and
> special GUID type. Take the remaining portion as $id and symlink the
> block device to /var/lib/ceph/osd-journal/$id. Add $id to found. (TODO
> handle pre-existing)
> - for each $id in found, if we have both osd-journal and osd-data,
> start a ceph-osd for it
> 
> 
> Moving journal
> 
> As an admin, I want to move an OSD data disk from one physical host
> (chassis) to another (e.g. for maintenance of non-hotswap power
> supply).
> I might have a single SSD, divided into multiple partitions, each
> acting as the journal for a single OSD data disk. I want to spread the
> load evenly across the rest of the cluster, so I move the OSD data
> disks to multiple destination machines, as long as they have 1 slot
> free. Naturally, I cannot easily saw the SSD apart and move it
> physically.
> 
> I would like to be able to:
> 
> 1. shut down the osd daemon
> 2. explicitly flush out & invalidate the journal on SSD (after this,
> the journal would not be marked with the osd id and fsid anymore)
> 3. move the HDD
> 4. on the new host, assign a blank SSD partition and initialize it
> with the right fsid etc metadata

I have no thoughts on the rest of it, but I believe what you're asking for here is the existing
ceph-osd --flushjournal
Although this doesn't invalidate the existing journal (now, at least), it will let you do prototyping without
much difficulty. :)
-Greg

 
> 
> It may actually be nicer to think of this as:
> 
> 1. shut down the osd daemon
> 2. move the journal inside the osd data dir, invalidate the old one
> (flushing it is an optimization)
> 3. physically move the HDD
> 4. move the journal from inside the osd data dir to assigned block device
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
> More majordomo info at http://vger.kernel.org/majordomo-info.html




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Braindump: path names, partition labels, FHS, auto-discovery
  2012-03-06 21:19 Braindump: path names, partition labels, FHS, auto-discovery Tommi Virtanen
  2012-03-06 22:29 ` Greg Farnum
@ 2012-03-07  9:55 ` David McBride
  2012-03-07 20:54   ` Sage Weil
  1 sibling, 1 reply; 11+ messages in thread
From: David McBride @ 2012-03-07  9:55 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: ceph-devel

On Tue, 2012-03-06 at 13:19 -0800, Tommi Virtanen wrote:

> - scan the partitions for partition label with the prefix
> "ceph-osd-data-".

Thought: I'd consider not using a numbered partition label as the
primary identifier for an OSD.

There are failure modes that can occur, for example, if you have disks
from multiple different Ceph clusters accessible to a given host, or if
you have a partially failed (or historical copy) of an OSD disk
accessible at the same time as a current instance.

(Though you might reasonably rule these cases as out-of-scope.)

To make handling cases like these straightforward, I suspect Ceph may
want to use something functionally equivalent to an MD superblock --
though in practice, with an OSD, this could simply be a file containing
the appropriate meta-data.

In fact, I imagine that the OSDs could already contain the necessary
fields -- a reference to their parent cluster's UUID, to ensure foreign
volumes aren't mistakenly mounted; something like mdadm's event-counters
to distinguish between current/historical versions of the same OSD.
(Configuration epoch-count?); a UUID reference to that OSD's journal
file, etc.

 - - -

Perhaps related to this, I've been looking to determine whether it's
feasible to build and configure a Ceph cluster incrementally -- building
an initial cluster containing just a single MON node, and then piecewise
adding additional OSDs / MDSs / MONs to build up to the full-set.

In part, this is so that the processes for initially setting up the
cluster and for expanding the cluster once its in operation are
identical.  But this is also to avoid needing to hand-maintain a
configuration file, replicated across all hosts, that enumerates all of
the different cluster elements -- replicating a function already handled
better by the MON elements.

I can almost see the ceph.conf file only being used at cluster
initialization-time, then discarded in favour of run-time commands that
update the live cluster state.

Is this practical?  (Or even desirable?)

Cheers,
David
-- 
David McBride <dwm@doc.ic.ac.uk>
Department of Computing, Imperial College, London


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Braindump: path names, partition labels, FHS, auto-discovery
  2012-03-07  9:55 ` David McBride
@ 2012-03-07 20:54   ` Sage Weil
  2012-03-19  9:21     ` Bernard Grymonpon
  0 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2012-03-07 20:54 UTC (permalink / raw)
  To: David McBride; +Cc: Tommi Virtanen, ceph-devel

On Wed, 7 Mar 2012, David McBride wrote:
> On Tue, 2012-03-06 at 13:19 -0800, Tommi Virtanen wrote:
> 
> > - scan the partitions for partition label with the prefix
> > "ceph-osd-data-".
> 
> Thought: I'd consider not using a numbered partition label as the
> primary identifier for an OSD.
> 
> There are failure modes that can occur, for example, if you have disks
> from multiple different Ceph clusters accessible to a given host, or if
> you have a partially failed (or historical copy) of an OSD disk
> accessible at the same time as a current instance.
> 
> (Though you might reasonably rule these cases as out-of-scope.)
> 
> To make handling cases like these straightforward, I suspect Ceph may
> want to use something functionally equivalent to an MD superblock --
> though in practice, with an OSD, this could simply be a file containing
> the appropriate meta-data.
> 
> In fact, I imagine that the OSDs could already contain the necessary
> fields -- a reference to their parent cluster's UUID, to ensure foreign
> volumes aren't mistakenly mounted; something like mdadm's event-counters
> to distinguish between current/historical versions of the same OSD.
> (Configuration epoch-count?); a UUID reference to that OSD's journal
> file, etc.

We're mostly there.  Each cluster has a uuid, and each ceph-osd instance 
gets a uuid when you do ceph-osd --mkfs.  That uuid is recorded in the osd 
data dir and in the journal, so you know that they go together.  

I think the 'epoch count' type stuff is sort of subsumed by all the osdmap 
versioning and so forth... are you imagining a duplicate/backup instance 
of an osd drive getting plugged in or something?  We don't guard for 
that, but I'm not sure offhand how we would.  :/

Anyway, I suspect the missing piece here is to incorporate the uuids into 
the path names somehow.  

TV wrote:
> - FHS says human-editable configuration goes in /etc
> - FHS says machine-editable state goes in /var/lib/ceph
> - use /var/lib/ceph/mon/$id/ for mon.$id
> - use /var/lib/ceph/osd-journal/$id for osd.$id journal; symlink to
> actual location
> - use /var/lib/ceph/osd-data/$id for osd.$id data; may be a symlink to
> actual location?

I wonder if these should be something like

 /var/lib/ceph/$cluster_uuid/mon/$id
 /var/lib/ceph/$cluster_uuid/osd-data/$osd_uuid.$id
 /var/lib/ceph/$cluster_uuid/osd-journal/$osd_uuid.$id

so that cluster instances don't stomp on one another.  OTOH, that would 
imply that we should do something like

 /etc/ceph/$cluster_uuid/ceph.conf, keyring, etc.

too.


>  - - -
> 
> Perhaps related to this, I've been looking to determine whether it's
> feasible to build and configure a Ceph cluster incrementally -- building
> an initial cluster containing just a single MON node, and then piecewise
> adding additional OSDs / MDSs / MONs to build up to the full-set.
> 
> In part, this is so that the processes for initially setting up the
> cluster and for expanding the cluster once its in operation are
> identical.  But this is also to avoid needing to hand-maintain a
> configuration file, replicated across all hosts, that enumerates all of
> the different cluster elements -- replicating a function already handled
> better by the MON elements.
> 
> I can almost see the ceph.conf file only being used at cluster
> initialization-time, then discarded in favour of run-time commands that
> update the live cluster state.
> 
> Is this practical?  (Or even desirable?)

This is exactly what the eventual chef/juju/etc building blocks will do.  
The tricky part is really the monitor cluster bootstrap (because you may 
have 3 of them coming up in parallel, and they need to form an initial 
quorum in a safe/sane way).  Once that happens, expanding the cluster is 
pretty mechanical.

The goal is to provide building blocks (simple scripts, hooks, whatever) 
for doing things like mapping a new block device to the proper location, 
starting up the appropriate ceph-osd, initializing/labeling a new device, 
creating a new ceph-osd on it and adding it to the cluster, etc.  The 
chef/juju/whatever scripts would then build on the common set of tools.

Most of the pieces are worked out in TV's head or mine, but we haven't had 
time to put it all together.  First we need to get our new qa hardware 
online..

sage

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Braindump: path names, partition labels, FHS, auto-discovery
  2012-03-07 20:54   ` Sage Weil
@ 2012-03-19  9:21     ` Bernard Grymonpon
  2012-03-20  7:25       ` Sage Weil
  2012-03-27 18:21       ` Tommi Virtanen
  0 siblings, 2 replies; 11+ messages in thread
From: Bernard Grymonpon @ 2012-03-19  9:21 UTC (permalink / raw)
  To: ceph-devel

Sage Weil <sage <at> newdream.net> writes:

> 
> On Wed, 7 Mar 2012, David McBride wrote:
> > On Tue, 2012-03-06 at 13:19 -0800, Tommi Virtanen wrote:
> > 
> > > - scan the partitions for partition label with the prefix
> > > "ceph-osd-data-".
> > 
> > Thought: I'd consider not using a numbered partition label as the
> > primary identifier for an OSD.
> > 

<snip>

> > To make handling cases like these straightforward, I suspect Ceph may
> > want to use something functionally equivalent to an MD superblock --
> > though in practice, with an OSD, this could simply be a file containing
> > the appropriate meta-data.
> > 
> > In fact, I imagine that the OSDs could already contain the necessary
> > fields -- a reference to their parent cluster's UUID, to ensure foreign
> > volumes aren't mistakenly mounted; something like mdadm's event-counters
> > to distinguish between current/historical versions of the same OSD.
> > (Configuration epoch-count?); a UUID reference to that OSD's journal
> > file, etc.
> 
> We're mostly there.  Each cluster has a uuid, and each ceph-osd instance 
> gets a uuid when you do ceph-osd --mkfs.  That uuid is recorded in the osd 
> data dir and in the journal, so you know that they go together.  
> 
> I think the 'epoch count' type stuff is sort of subsumed by all the osdmap 
> versioning and so forth... are you imagining a duplicate/backup instance 
> of an osd drive getting plugged in or something?  We don't guard for 
> that, but I'm not sure offhand how we would.  :/
> 
> Anyway, I suspect the missing piece here is to incorporate the uuids into 
> the path names somehow.  

I would discourage using the disk-labels, as you might not always be able to
set these (consider imported luns from other storage boxes, or some internal
regulations in labeling disks...). I would trust the sysadmin to know which
mounts go where to get everything in place (he himself can use the labels in
his fstab or some clever bootscript), and then use the ceph-metadata to start
only "sane" OSDs/MONs/...

In my opinion, a OSD should be able to figure out himself if he has a "good"
dataset to "boot" with - and it is up to the mon to either reject or accept
this OSD as a good/valid part of the cluster, or if it needs re-syncing.

> TV wrote: > > - FHS says human-editable configuration goes in /etc > > - FHS
says machine-editable state goes in /var/lib/ceph > > - use
/var/lib/ceph/mon/$id/ for mon.$id > > - use /var/lib/ceph/osd-journal/$id for
osd.$id journal; symlink to > > actual location > > - use
/var/lib/ceph/osd-data/$id for osd.$id data; may be a symlink to > > actual
location? > > I wonder if these should be something like > >
/var/lib/ceph/$cluster_uuid/mon/$id >
/var/lib/ceph/$cluster_uuid/osd-data/$osd_uuid.$id >
/var/lib/ceph/$cluster_uuid/osd-journal/$osd_uuid.$id

The numbering of the MON/OSD's is a bit a hassle now, best would be (in my
opinion)

/var/lib/ceph/$cluster_uuid/osd/$osd_uuid/data
/var/lib/ceph/$cluster_uuid/osd/$osd_uuid/journal
/var/lib/ceph/$cluster_uuid/osd/$mon_uuid/

Journal and data go together for the OSD - so no need to split these on a
lower level. One can't have a OSD without both, so seems fair to put them next
to each other...


> so that cluster instances don't stomp on one another.  OTOH, that would 
> imply that we should do something like
> 
>  /etc/ceph/$cluster_uuid/ceph.conf, keyring, etc.

Ack, although at cluster creation, the cluster_uuid is unknown, which kind of
gives a chicken-egg situation.



> > Perhaps related to this, I've been looking to determine whether it's
> > feasible to build and configure a Ceph cluster incrementally -- building
> > an initial cluster containing just a single MON node, and then piecewise
> > adding additional OSDs / MDSs / MONs to build up to the full-set.

This would be ideal - specially for use in chef (and probably other deployment
automation tools).

> > 
> > In part, this is so that the processes for initially setting up the
> > cluster and for expanding the cluster once its in operation are
> > identical.  But this is also to avoid needing to hand-maintain a
> > configuration file, replicated across all hosts, that enumerates all of
> > the different cluster elements -- replicating a function already handled
> > better by the MON elements.
> > 
> > I can almost see the ceph.conf file only being used at cluster
> > initialization-time, then discarded in favour of run-time commands that
> > update the live cluster state.
> > 
> > Is this practical?  (Or even desirable?)
> 
> This is exactly what the eventual chef/juju/etc building blocks will do.  
> The tricky part is really the monitor cluster bootstrap (because you may 
> have 3 of them coming up in parallel, and they need to form an initial 
> quorum in a safe/sane way).  Once that happens, expanding the cluster is 
> pretty mechanical.
> 
> The goal is to provide building blocks (simple scripts, hooks, whatever) 
> for doing things like mapping a new block device to the proper location, 
> starting up the appropriate ceph-osd, initializing/labeling a new device, 
> creating a new ceph-osd on it and adding it to the cluster, etc.  The 
> chef/juju/whatever scripts would then build on the common set of tools.
> 
> Most of the pieces are worked out in TV's head or mine, but we haven't had 
> time to put it all together.  First we need to get our new qa hardware 
> online..

As I've been constructing some cookbooks to setup a default cluster, this is
what I bumped into:

- the numbering (0, 1, ...) of the OSDs and their need to keep the same number
  throughout the lifetime of the cluster is a bit a hassle. Each OSD needs to
  have a complete view of all the components of the cluster before it can
  determine it's own ID. A random, auto-generated UUID would be nice (I
  currently solved this by assigning each cluster a global "clustername", and
  search the chef server for all nodes, look for the highest indexed OSDs, and
  increment this to determine the new OSD's index - there must be a better
  way).

- the configfile needs to be the same on all hosts - which is only partially
  true. From my point of view, a OSD should only have some way of contacting
  one mon, which would inform the OSD of the cluster layout. So, only the
  mon-info should be there (together with the info for the OSD itself,
  obviously)

- there is a chicken-egg problem in the authentication of a osd to the mon. An
  OSD should have permission to join the mon, for which we need to add the OSD
  to the mon. As chef works on the node, and can't trigger stuff on other
  nodes, the node that will hold the OSD needs some way of authenticating
  itself to the mon (I solved this by storing the "client.admin" secret on the
  mon-node, and then pulling this from there on the osd node, and using it to
  register myself to the mon. It is like putting a copy of your homekey on
  your front door...). I see no obvious solution here.

- the current (debian) start/stop scripts are a hassle to work with, as chef
  doesn't understand the third parameter (/etc/init.d/ceph start mon.0). Each
  mon / osd / ... should have its own start/stop script.

- there should be some way to ask a local running OSD/MON for its status,
  without having to go through the monitor-nodes. Sort of "ceph-local-daemon
  --uuid=xxx --type=mon status", which would inform us if it is running,
  healthy, part of the cluster, lost in space...

- growing the cluster bit by bit would be ideal, this is how chef works (it
  handles node per node, not a bunch of nodes in one go) 

- ideal, there would be a automatic-crushmap-expansion command which would add
  a device to an existing crushmap (or remove one). Now, the crushmap needs to
  be reconstructed completely, and if your numbering changes somehow, you're
  screwed. Ideal would be "take the current crushmap and add OSD with uuid
  xxx" - "take the current crushmap and remove OSD xxx"

Just my thoughts! I've been following the ceph project for a while now, set up
a couple of test clusters in the past and the last two weeks, and made the
cookbooks to make my life easier (and bumped in a lot of Ops-trouble doing
this...).

Rgds,
Bernard






^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Braindump: path names, partition labels, FHS, auto-discovery
  2012-03-19  9:21     ` Bernard Grymonpon
@ 2012-03-20  7:25       ` Sage Weil
  2012-03-20  7:55         ` Bernard Grymonpon
  2012-03-27 11:29         ` David McBride
  2012-03-27 18:21       ` Tommi Virtanen
  1 sibling, 2 replies; 11+ messages in thread
From: Sage Weil @ 2012-03-20  7:25 UTC (permalink / raw)
  To: Bernard Grymonpon; +Cc: ceph-devel

On Mon, 19 Mar 2012, Bernard Grymonpon wrote:
> Sage Weil <sage <at> newdream.net> writes:
> 
> > 
> > On Wed, 7 Mar 2012, David McBride wrote:
> > > On Tue, 2012-03-06 at 13:19 -0800, Tommi Virtanen wrote:
> > > 
> > > > - scan the partitions for partition label with the prefix
> > > > "ceph-osd-data-".
> > > 
> > > Thought: I'd consider not using a numbered partition label as the
> > > primary identifier for an OSD.
> > > 
> 
> <snip>
> 
> > > To make handling cases like these straightforward, I suspect Ceph may
> > > want to use something functionally equivalent to an MD superblock --
> > > though in practice, with an OSD, this could simply be a file containing
> > > the appropriate meta-data.
> > > 
> > > In fact, I imagine that the OSDs could already contain the necessary
> > > fields -- a reference to their parent cluster's UUID, to ensure foreign
> > > volumes aren't mistakenly mounted; something like mdadm's event-counters
> > > to distinguish between current/historical versions of the same OSD.
> > > (Configuration epoch-count?); a UUID reference to that OSD's journal
> > > file, etc.
> > 
> > We're mostly there.  Each cluster has a uuid, and each ceph-osd instance 
> > gets a uuid when you do ceph-osd --mkfs.  That uuid is recorded in the osd 
> > data dir and in the journal, so you know that they go together.  
> > 
> > I think the 'epoch count' type stuff is sort of subsumed by all the osdmap 
> > versioning and so forth... are you imagining a duplicate/backup instance 
> > of an osd drive getting plugged in or something?  We don't guard for 
> > that, but I'm not sure offhand how we would.  :/
> > 
> > Anyway, I suspect the missing piece here is to incorporate the uuids into 
> > the path names somehow.  
> 
> I would discourage using the disk-labels, as you might not always be able to
> set these (consider imported luns from other storage boxes, or some internal
> regulations in labeling disks...). I would trust the sysadmin to know which
> mounts go where to get everything in place (he himself can use the labels in
> his fstab or some clever bootscript), and then use the ceph-metadata to start
> only "sane" OSDs/MONs/...

The goal is to make this optional.  I.e., provide tools to use via udev 
to mount disks in good locations based on labels, but not require them if 
the sysadmin has some other idea about how it should be done.

Ideally, the start/stop scripts should be able to look in /var/lib/ceph 
and start daemons for whatever it sees there that looks sane.

> In my opinion, a OSD should be able to figure out himself if he has a "good"
> dataset to "boot" with - and it is up to the mon to either reject or accept
> this OSD as a good/valid part of the cluster, or if it needs re-syncing.

Yes.
 
> > TV wrote: > > - FHS says human-editable configuration goes in /etc > > - FHS
> says machine-editable state goes in /var/lib/ceph > > - use
> /var/lib/ceph/mon/$id/ for mon.$id > > - use /var/lib/ceph/osd-journal/$id for
> osd.$id journal; symlink to > > actual location > > - use
> /var/lib/ceph/osd-data/$id for osd.$id data; may be a symlink to > > actual
> location? > > I wonder if these should be something like > >
> /var/lib/ceph/$cluster_uuid/mon/$id >
> /var/lib/ceph/$cluster_uuid/osd-data/$osd_uuid.$id >
> /var/lib/ceph/$cluster_uuid/osd-journal/$osd_uuid.$id
> 
> The numbering of the MON/OSD's is a bit a hassle now, best would be (in my
> opinion)
> 
> /var/lib/ceph/$cluster_uuid/osd/$osd_uuid/data
> /var/lib/ceph/$cluster_uuid/osd/$osd_uuid/journal
> /var/lib/ceph/$cluster_uuid/osd/$mon_uuid/
> 
> Journal and data go together for the OSD - so no need to split these on a
> lower level. One can't have a OSD without both, so seems fair to put them next
> to each other...

Currently the ceph-osd it told which id to be on startup; the only real 
shift here would be to let you specify some uuids instead and have it pull 
it's rank (id) out of the .../whoami file.

Monitors have user-friendly names ('foo', `hostname`).  We could add uuids 
there too, but I'm less sure how useful that'll be.

> > so that cluster instances don't stomp on one another.  OTOH, that would 
> > imply that we should do something like
> > 
> >  /etc/ceph/$cluster_uuid/ceph.conf, keyring, etc.
> 
> Ack, although at cluster creation, the cluster_uuid is unknown, which kind of
> gives a chicken-egg situation.

Making the mkfs process take the cluster_uuid as input is easy, although 
it makes it possible for a bad sysadmin to share a uuid across clusters.


> As I've been constructing some cookbooks to setup a default cluster, this is
> what I bumped into:
> 
> - the numbering (0, 1, ...) of the OSDs and their need to keep the same number
>   throughout the lifetime of the cluster is a bit a hassle. Each OSD needs to
>   have a complete view of all the components of the cluster before it can
>   determine it's own ID. A random, auto-generated UUID would be nice (I
>   currently solved this by assigning each cluster a global "clustername", and
>   search the chef server for all nodes, look for the highest indexed OSDs, and
>   increment this to determine the new OSD's index - there must be a better
>   way).

The 'ceph osd create' command will handle the allocation of a new unique 
id for you.  We could supplement that with a uuid to make it a bit more 
robust (if we add the osd uuids to the osdmap... probbaly a good idea 
anyway).

> - the configfile needs to be the same on all hosts - which is only partially
>   true. From my point of view, a OSD should only have some way of contacting
>   one mon, which would inform the OSD of the cluster layout. So, only the
>   mon-info should be there (together with the info for the OSD itself,
>   obviously)

It doesn't, actually; it's only need to bootstrap (to find the monitor(s) 
on startup) and to set any config values that are non-default.  The 
current start/stop script wants to see the loca instances there, but that 
can be replaced by looking for directories in /var/lib/ceph/.

> - there is a chicken-egg problem in the authentication of a osd to the mon. An
>   OSD should have permission to join the mon, for which we need to add the OSD
>   to the mon. As chef works on the node, and can't trigger stuff on other
>   nodes, the node that will hold the OSD needs some way of authenticating
>   itself to the mon (I solved this by storing the "client.admin" secret on the
>   mon-node, and then pulling this from there on the osd node, and using it to
>   register myself to the mon. It is like putting a copy of your homekey on
>   your front door...). I see no obvious solution here.

We've set up a special key that has permission to create new osds only, 
but again it's pretty bad security.  Chef's model just doesn't work well 
here.

> - the current (debian) start/stop scripts are a hassle to work with, as chef
>   doesn't understand the third parameter (/etc/init.d/ceph start mon.0). Each
>   mon / osd / ... should have its own start/stop script.
> 
> - there should be some way to ask a local running OSD/MON for its status,
>   without having to go through the monitor-nodes. Sort of "ceph-local-daemon
>   --uuid=xxx --type=mon status", which would inform us if it is running,
>   healthy, part of the cluster, lost in space...

Each daemon has a socket in /var/run/ceph to communicate with it; adding a 
health command would be pretty straightforward.

> - growing the cluster bit by bit would be ideal, this is how chef works (it
>   handles node per node, not a bunch of nodes in one go) 

This works now, with the exception of monitor cluster bootstrap being 
awkward.

> - ideal, there would be a automatic-crushmap-expansion command which would add
>   a device to an existing crushmap (or remove one). Now, the crushmap needs to
>   be reconstructed completely, and if your numbering changes somehow, you're
>   screwed. Ideal would be "take the current crushmap and add OSD with uuid
>   xxx" - "take the current crushmap and remove OSD xxx"

You can do this now, too:

 ceph osd crush add <$osdnum> <osd.$osdnum> <weight> host=foo rack=bar [...]

The crush map has a alphanumeric name that crush ignores (at least for 
devices), although osd.$num is what we generate by default.  The 
keys/values are crush types for other levels of the hierarchy, so that you 
can specify where in the tree the new item should be placed.


The questions for me now are what we should use for default locations and 
document as best practice.

 - do we want $cluster_uuid all over the place?
 - should we allow osds to be started by $uuid instead of rank?
 - is it sufficient for init scripts to blindly start everything in 
   /var/lib/ceph, or do we need equivalent functionality to the 'auto 
   start = false' in ceph.conf (that Wido is using)?
 - is a single init script still appropriate, or do we want something 
   better?  (I'm not very familiar with the new best practices for upstart 
   or systemd for multi-instance services like this.)
 - uuids for monitors?
 - osd uuids in osdmap?

sage

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Braindump: path names, partition labels, FHS, auto-discovery
  2012-03-20  7:25       ` Sage Weil
@ 2012-03-20  7:55         ` Bernard Grymonpon
  2012-03-27 11:29         ` David McBride
  1 sibling, 0 replies; 11+ messages in thread
From: Bernard Grymonpon @ 2012-03-20  7:55 UTC (permalink / raw)
  To: ceph-devel


On 20 Mar 2012, at 08:25, Sage Weil wrote:

> On Mon, 19 Mar 2012, Bernard Grymonpon wrote:
>> Sage Weil <sage <at> newdream.net> writes:
>> 
>>> 
>>> On Wed, 7 Mar 2012, David McBride wrote:
>>>> On Tue, 2012-03-06 at 13:19 -0800, Tommi Virtanen wrote:
>>>> 
>>>>> - scan the partitions for partition label with the prefix
>>>>> "ceph-osd-data-".
>>>> 
>>>> Thought: I'd consider not using a numbered partition label as the
>>>> primary identifier for an OSD.
>>>> 
>> 
>> <snip>
>> 
>>>> To make handling cases like these straightforward, I suspect Ceph may
>>>> want to use something functionally equivalent to an MD superblock --
>>>> though in practice, with an OSD, this could simply be a file containing
>>>> the appropriate meta-data.
>>>> 
>>>> In fact, I imagine that the OSDs could already contain the necessary
>>>> fields -- a reference to their parent cluster's UUID, to ensure foreign
>>>> volumes aren't mistakenly mounted; something like mdadm's event-counters
>>>> to distinguish between current/historical versions of the same OSD.
>>>> (Configuration epoch-count?); a UUID reference to that OSD's journal
>>>> file, etc.
>>> 
>>> We're mostly there.  Each cluster has a uuid, and each ceph-osd instance 
>>> gets a uuid when you do ceph-osd --mkfs.  That uuid is recorded in the osd 
>>> data dir and in the journal, so you know that they go together.  
>>> 
>>> I think the 'epoch count' type stuff is sort of subsumed by all the osdmap 
>>> versioning and so forth... are you imagining a duplicate/backup instance 
>>> of an osd drive getting plugged in or something?  We don't guard for 
>>> that, but I'm not sure offhand how we would.  :/
>>> 
>>> Anyway, I suspect the missing piece here is to incorporate the uuids into 
>>> the path names somehow.  
>> 
>> I would discourage using the disk-labels, as you might not always be able to
>> set these (consider imported luns from other storage boxes, or some internal
>> regulations in labeling disks...). I would trust the sysadmin to know which
>> mounts go where to get everything in place (he himself can use the labels in
>> his fstab or some clever bootscript), and then use the ceph-metadata to start
>> only "sane" OSDs/MONs/...
> 
> The goal is to make this optional.  I.e., provide tools to use via udev 
> to mount disks in good locations based on labels, but not require them if 
> the sysadmin has some other idea about how it should be done.
> 
> Ideally, the start/stop scripts should be able to look in /var/lib/ceph 
> and start daemons for whatever it sees there that looks sane.

Start/stop scripts should not be that intelligent in my opinion - a start/stop script should just start or stop whatever it is told to start/stop (usually through a simple config file, pointing to the correct folders). If a sysadmin decides to make a backup copy of some data in /var/lib/ceph, it should not result in suddenly spawning new instances... 

Also, a start/stop script should start/stop very clearly a certain osd/mon... you don't want to restart/start/stop each and every daemon every time (and the optional third parameter to start/stop script is non-common).

> 
>> In my opinion, a OSD should be able to figure out himself if he has a "good"
>> dataset to "boot" with - and it is up to the mon to either reject or accept
>> this OSD as a good/valid part of the cluster, or if it needs re-syncing.
> 
> Yes.
> 
>>> TV wrote: > > - FHS says human-editable configuration goes in /etc > > - FHS
>> says machine-editable state goes in /var/lib/ceph > > - use
>> /var/lib/ceph/mon/$id/ for mon.$id > > - use /var/lib/ceph/osd-journal/$id for
>> osd.$id journal; symlink to > > actual location > > - use
>> /var/lib/ceph/osd-data/$id for osd.$id data; may be a symlink to > > actual
>> location? > > I wonder if these should be something like > >
>> /var/lib/ceph/$cluster_uuid/mon/$id >
>> /var/lib/ceph/$cluster_uuid/osd-data/$osd_uuid.$id >
>> /var/lib/ceph/$cluster_uuid/osd-journal/$osd_uuid.$id
>> 
>> The numbering of the MON/OSD's is a bit a hassle now, best would be (in my
>> opinion)
>> 
>> /var/lib/ceph/$cluster_uuid/osd/$osd_uuid/data
>> /var/lib/ceph/$cluster_uuid/osd/$osd_uuid/journal
>> /var/lib/ceph/$cluster_uuid/osd/$mon_uuid/
>> 
>> Journal and data go together for the OSD - so no need to split these on a
>> lower level. One can't have a OSD without both, so seems fair to put them next
>> to each other...
> 
> Currently the ceph-osd it told which id to be on startup; the only real 
> shift here would be to let you specify some uuids instead and have it pull 
> it's rank (id) out of the .../whoami file.
> 
> Monitors have user-friendly names ('foo', `hostname`).  We could add uuids 
> there too, but I'm less sure how useful that'll be.

Consistency would be the word you're looking for... both in ceph and in the storage-field. The storage ops is used to long random strings identifying parts (luns, identifiers, ...). 

Allowing the sysadmin to specify the UUID themselfs would give the best of both worlds: lazy admins use the generated UUIDs, others generate their own (I can image that having the node identified by its hostname, or some other label might be useful in a 10+ node cluster...). 

> 
>>> so that cluster instances don't stomp on one another.  OTOH, that would 
>>> imply that we should do something like
>>> 
>>> /etc/ceph/$cluster_uuid/ceph.conf, keyring, etc.
>> 
>> Ack, although at cluster creation, the cluster_uuid is unknown, which kind of
>> gives a chicken-egg situation.
> 
> Making the mkfs process take the cluster_uuid as input is easy, although 
> it makes it possible for a bad sysadmin to share a uuid across clusters.

Don't care for bad sysadmins :)

> 
> 
>> As I've been constructing some cookbooks to setup a default cluster, this is
>> what I bumped into:
>> 
>> - the numbering (0, 1, ...) of the OSDs and their need to keep the same number
>>  throughout the lifetime of the cluster is a bit a hassle. Each OSD needs to
>>  have a complete view of all the components of the cluster before it can
>>  determine it's own ID. A random, auto-generated UUID would be nice (I
>>  currently solved this by assigning each cluster a global "clustername", and
>>  search the chef server for all nodes, look for the highest indexed OSDs, and
>>  increment this to determine the new OSD's index - there must be a better
>>  way).
> 
> The 'ceph osd create' command will handle the allocation of a new unique 
> id for you.  We could supplement that with a uuid to make it a bit more 
> robust (if we add the osd uuids to the osdmap... probbaly a good idea 
> anyway).

For this to work, you need a connection to the monitor(s), which gives security issues, and make the creation of a OSD a two-node-operation. A OSD should generate a UUID itself, and that is his one and only identifier. Once joined in a cluster for the first time, it might record the cluster uuid in its metadata. If the uuid of the osd clashes with an existing uuid the mon should reject it.

> 
>> - the configfile needs to be the same on all hosts - which is only partially
>>  true. From my point of view, a OSD should only have some way of contacting
>>  one mon, which would inform the OSD of the cluster layout. So, only the
>>  mon-info should be there (together with the info for the OSD itself,
>>  obviously)
> 
> It doesn't, actually; it's only need to bootstrap (to find the monitor(s) 
> on startup) and to set any config values that are non-default.  The 
> current start/stop script wants to see the loca instances there, but that 
> can be replaced by looking for directories in /var/lib/ceph/.
> 
>> - there is a chicken-egg problem in the authentication of a osd to the mon. An
>>  OSD should have permission to join the mon, for which we need to add the OSD
>>  to the mon. As chef works on the node, and can't trigger stuff on other
>>  nodes, the node that will hold the OSD needs some way of authenticating
>>  itself to the mon (I solved this by storing the "client.admin" secret on the
>>  mon-node, and then pulling this from there on the osd node, and using it to
>>  register myself to the mon. It is like putting a copy of your homekey on
>>  your front door...). I see no obvious solution here.
> 
> We've set up a special key that has permission to create new osds only, 
> but again it's pretty bad security.  Chef's model just doesn't work well 
> here.

There will always be some sort of "master key" for the cluster to create/accept new instances (either this, or no security at all). I don't see a way around it (or you will give up parts of the security model).

The more i think about it, all of the security between mons and osds is a bit strange - most of the time your storage cluster will be on a isolated, dedicated network (the private network/public network parameters do this already). Security and rights towards the client nodes is needed,... 

> 
>> - the current (debian) start/stop scripts are a hassle to work with, as chef
>>  doesn't understand the third parameter (/etc/init.d/ceph start mon.0). Each
>>  mon / osd / ... should have its own start/stop script.
>> 
>> - there should be some way to ask a local running OSD/MON for its status,
>>  without having to go through the monitor-nodes. Sort of "ceph-local-daemon
>>  --uuid=xxx --type=mon status", which would inform us if it is running,
>>  healthy, part of the cluster, lost in space...
> 
> Each daemon has a socket in /var/run/ceph to communicate with it; adding a 
> health command would be pretty straightforward.
> 
>> - growing the cluster bit by bit would be ideal, this is how chef works (it
>>  handles node per node, not a bunch of nodes in one go) 
> 
> This works now, with the exception of monitor cluster bootstrap being 
> awkward.

How is the initial amount of pgs determined. If you start with no OSDs, and add them, do the pgs grow?

> 
>> - ideal, there would be a automatic-crushmap-expansion command which would add
>>  a device to an existing crushmap (or remove one). Now, the crushmap needs to
>>  be reconstructed completely, and if your numbering changes somehow, you're
>>  screwed. Ideal would be "take the current crushmap and add OSD with uuid
>>  xxx" - "take the current crushmap and remove OSD xxx"
> 
> You can do this now, too:
> 
> ceph osd crush add <$osdnum> <osd.$osdnum> <weight> host=foo rack=bar [...]
> 
> The crush map has a alphanumeric name that crush ignores (at least for 
> devices), although osd.$num is what we generate by default.  The 
> keys/values are crush types for other levels of the hierarchy, so that you 
> can specify where in the tree the new item should be placed.

Nice, I'll have a look at this later this week.

> 
> The questions for me now are what we should use for default locations and 
> document as best practice.
> 
> - do we want $cluster_uuid all over the place?

In my opinion - no. I don't see a single machine serving two clusters at once. Only in very special test-cases this might be the case. If a OSD knows which cluster it belongs to (and records this in his metadata), that would be fine.

> - should we allow osds to be started by $uuid instead of rank?

Yes, please. Numbering things is a pain, if you don't have/control all the nodes at once. 

> - is it sufficient for init scripts to blindly start everything in 
>   /var/lib/ceph, or do we need equivalent functionality to the 'auto 
>   start = false' in ceph.conf (that Wido is using)?
> - is a single init script still appropriate, or do we want something 
>   better?  (I'm not very familiar with the new best practices for upstart 
>   or systemd for multi-instance services like this.)

Start/stop scripts should be stupid in my opinion - see above. 

> - uuids for monitors?

yes

> - osd uuids in osdmap?

yes, loose the "rank" completely if possible.  

Rgds,
Bernard

> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Braindump: path names, partition labels, FHS, auto-discovery
  2012-03-20  7:25       ` Sage Weil
  2012-03-20  7:55         ` Bernard Grymonpon
@ 2012-03-27 11:29         ` David McBride
  2012-03-27 19:40           ` Sage Weil
  1 sibling, 1 reply; 11+ messages in thread
From: David McBride @ 2012-03-27 11:29 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ceph Development Mailing List

On Tue, 2012-03-20 at 00:25 -0700, Sage Weil wrote:

> Currently the ceph-osd it told which id to be on startup; the only real 
> shift here would be to let you specify some uuids instead and have it pull 
> it's rank (id) out of the .../whoami file.

I'm increasingly coming to believe that an OSD's rank should not be
exposed to the admin/user.  While it's clearly important as an internal
implementation detail, I can't currently see a reason why the admin
needs to know how an OSDs rank, or why it can't (in principle) be
dynamically managed on the administrator's behalf.

A non-trivial part of the complexity of (re-)configuring a running
cluster as nodes are added and removed is the correct numbering of OSDs.

At the moment I'm still experimenting -- I don't know what's supposed to
happen when low-numbered OSDs are removed; do all the existing ones
renumber?  Or do you get fragmentation in the OSD number space?

If it's the former, then the rank of an OSD is metadata that can change
during its lifetime -- meaning that it's probably not a good idea to use
it in path-names, for example.

I suspect that using UUIDs and/or human-readable labels to refer to OSDs
is probably going to be superior than using the OSDs' rank.

> We've set up a special key that has permission to create new osds only, 
> but again it's pretty bad security.  Chef's model just doesn't work well 
> here.

If I understand the model correctly:

 - Each individual daemon generates its own keys.  This is a secret key 
   in the symmetric-cryptography sense.
 - Each daemon needs to have its keys registered with the MON cluster.
   (The MON operates as a centrally-trusted key distribution centre.)
 - To register keys with the MON cluster, a cluster administrative 
   privilege is required.
 - The registration process also updates the effective access-control 
   privilege set associated with that key.

Two thoughts:

 1. I suspect that some small changes to the bootstrap procedure could 
    result in a more streamlined process:

     - OSDs stop being responsible for generating their own keys.  
       Their keys are instead generated by a MON node and are stored in 
       the MON cluster.  As a result, the problem to be solved changes: 

         * before, an OSD needed to have the necessary privileges to
           update the cluster-wide configuration; 
         * now, the MON node only needs to have the necessary privileges
           to install an OSD's secret key on that OSD's host.

     - It should then be straightforward to set up a privileged, 
       automated process -- probably on a MON node -- to stage a copy
       of an OSDs secret key to the appropriate location on that OSD's 
       host.

       (This assumes that an OSDs host can be automatically determined 
       and authenticated using some existing means (SSH keys, Kerberos, 
       etc.) -- which I'd expect to be the case for any non-trivial 
       installation.)

     - This automated process could be triggered by the OSD   
       installation process -- either by an out-of-band script, or 
       conceivably by a key-solicitation message sent in-band by the 
       OSD itself.

 2. This sounds very similar to the model for Kerberos 5, as used in 
    the MIT Kerberos implementation and Microsoft's Active Directory.  
    It might be an interesting (future!) project to see how difficult 
    it would be to modify the Ceph daemons and protocols to use 
    Kerberos-based authentication as an alternative to cephx, possibly 
    via the GSS-API.

Aha, I note that this is probably not a surprise -- the 0.18 
release-notes, for example, note the similarity between cephx and
Kerberos.  I presume the goal in rolling your own was to avoid adding
barriers to deployment?

>  - is a single init script still appropriate, or do we want something 
>    better?  (I'm not very familiar with the new best practices for upstart 
>    or systemd for multi-instance services like this.)

At risk of duplicating part of the functionality of Upstart or systemd,
the traditional solution for this kind of problem is for the
multi-process tool to implement its own master-control process, and have
init start that.  Responsibility for (re-)starting sub-processes is then
delegated to this master daemon.  

As well as allowing sophisticated tool-specific logic for determining
what processes should/should not be started, this can be a useful point
of control for enforcing privilege separation and resource-limits
between sub-processes.

This approach can be powerful, but does require the overhead of
implementing and managing all of the necessary logic yourself.  I can't
comment on whether tools like Upstart, systemd, or something else could
easily be used to avoid incurring this additional cost.  This may be
something worth discussing with upstream developers..?

Cheers,
David
-- 
David McBride <dwm@doc.ic.ac.uk>
Department of Computing, Imperial College, London


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Braindump: path names, partition labels, FHS, auto-discovery
  2012-03-19  9:21     ` Bernard Grymonpon
  2012-03-20  7:25       ` Sage Weil
@ 2012-03-27 18:21       ` Tommi Virtanen
  1 sibling, 0 replies; 11+ messages in thread
From: Tommi Virtanen @ 2012-03-27 18:21 UTC (permalink / raw)
  To: Bernard Grymonpon; +Cc: ceph-devel

On Mon, Mar 19, 2012 at 02:21, Bernard Grymonpon <bernard@openminds.be> wrote:
> As I've been constructing some cookbooks to setup a default cluster, this is
> what I bumped into:
>
> - the numbering (0, 1, ...) of the OSDs and their need to keep the same number
>  throughout the lifetime of the cluster is a bit a hassle. Each OSD needs to
>  have a complete view of all the components of the cluster before it can
>  determine it's own ID. A random, auto-generated UUID would be nice (I
>  currently solved this by assigning each cluster a global "clustername", and
>  search the chef server for all nodes, look for the highest indexed OSDs, and
>  increment this to determine the new OSD's index - there must be a better
>  way).

That's why you ask the monitors to assign you one:
https://github.com/ceph/ceph-cookbooks/blob/1d381a3e1dd767c4c8ab668878285b545a70846a/ceph/recipes/bootstrap_osd.rb#L47

As far as I know, chef does NOT provide the necessary atomicity to
reliably allocate unique ids.

> - the configfile needs to be the same on all hosts - which is only partially
>  true. From my point of view, a OSD should only have some way of contacting
>  one mon, which would inform the OSD of the cluster layout. So, only the
>  mon-info should be there (together with the info for the OSD itself,
>  obviously)

Can't rely on a single mon, that'd the a single point of failure.

The only thing the config really needs is the monitor locations. I
expect the rest of this to slowly go away, as we improve the defaults
in the code and the cookbook:

https://github.com/ceph/ceph-cookbooks/blob/1d381a3e1dd767c4c8ab668878285b545a70846a/ceph/templates/default/ceph.conf.erb

> - there is a chicken-egg problem in the authentication of a osd to the mon. An
>  OSD should have permission to join the mon, for which we need to add the OSD
>  to the mon. As chef works on the node, and can't trigger stuff on other
>  nodes, the node that will hold the OSD needs some way of authenticating
>  itself to the mon (I solved this by storing the "client.admin" secret on the
>  mon-node, and then pulling this from there on the osd node, and using it to
>  register myself to the mon. It is like putting a copy of your homekey on
>  your front door...). I see no obvious solution here.

I use a "bootstrap-osd" key that can create new OSDs and authorize
keys for them. It's less powerful than client.admin.

https://github.com/ceph/ceph-cookbooks/blob/1d381a3e1dd767c4c8ab668878285b545a70846a/ceph/recipes/single_mon.rb#L49

https://github.com/ceph/ceph-cookbooks/blob/1d381a3e1dd767c4c8ab668878285b545a70846a/ceph/recipes/bootstrap_osd.rb#L20

> - the current (debian) start/stop scripts are a hassle to work with, as chef
>  doesn't understand the third parameter (/etc/init.d/ceph start mon.0). Each
>  mon / osd / ... should have its own start/stop script.

The cookbook uses upstart jobs, and runs an instance per osd id etc.

https://github.com/ceph/ceph-cookbooks/blob/1d381a3e1dd767c4c8ab668878285b545a70846a/ceph/templates/default/upstart-ceph-osd.conf.erb
https://github.com/ceph/ceph-cookbooks/tree/1d381a3e1dd767c4c8ab668878285b545a70846a/ceph/templates/default

> - there should be some way to ask a local running OSD/MON for its status,
>  without having to go through the monitor-nodes. Sort of "ceph-local-daemon
>  --uuid=xxx --type=mon status", which would inform us if it is running,
>  healthy, part of the cluster, lost in space...

That'd be the "admin socket". It's unfortunately not well documented currently.

> - growing the cluster bit by bit would be ideal, this is how chef works (it
>  handles node per node, not a bunch of nodes in one go)

The cookbook handles this, with some limitations that will be removed
once we have resources to work on it again.

> - ideal, there would be a automatic-crushmap-expansion command which would add
>  a device to an existing crushmap (or remove one). Now, the crushmap needs to
>  be reconstructed completely, and if your numbering changes somehow, you're
>  screwed. Ideal would be "take the current crushmap and add OSD with uuid
>  xxx" - "take the current crushmap and remove OSD xxx"

Exists already.

> Just my thoughts! I've been following the ceph project for a while now, set up
> a couple of test clusters in the past and the last two weeks, and made the
> cookbooks to make my life easier (and bumped in a lot of Ops-trouble doing
> this...).

To summarize:

Status of the cookbook at https://github.com/ceph/ceph-cookbooks is:

- it assumes you only run a single monitor
- it assumes you run 1 osd per node, as a subdirectory of /srv

Both of these restrictions will be eventually lifted, that was just to
get started.

Right now, I know we have one admin looking to lift the "1 osd per
node" limitation (he's ok doing mons manually), but other than that
I'm the only person who has put time into the cookbooks, and I'm
currently busy setting up our automated test infrastructure. We're
hiring Chef devopsy people, come help us!
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Braindump: path names, partition labels, FHS, auto-discovery
  2012-03-27 11:29         ` David McBride
@ 2012-03-27 19:40           ` Sage Weil
  2012-03-27 20:16             ` Tommi Virtanen
  0 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2012-03-27 19:40 UTC (permalink / raw)
  To: David McBride; +Cc: Ceph Development Mailing List

On Tue, 27 Mar 2012, David McBride wrote:
> On Tue, 2012-03-20 at 00:25 -0700, Sage Weil wrote:
> 
> > Currently the ceph-osd it told which id to be on startup; the only real 
> > shift here would be to let you specify some uuids instead and have it pull 
> > it's rank (id) out of the .../whoami file.
> 
> I'm increasingly coming to believe that an OSD's rank should not be
> exposed to the admin/user.  While it's clearly important as an internal
> implementation detail, I can't currently see a reason why the admin
> needs to know how an OSDs rank, or why it can't (in principle) be
> dynamically managed on the administrator's behalf.
> 
> A non-trivial part of the complexity of (re-)configuring a running
> cluster as nodes are added and removed is the correct numbering of OSDs.
> 
> At the moment I'm still experimenting -- I don't know what's supposed to
> happen when low-numbered OSDs are removed; do all the existing ones
> renumber?  Or do you get fragmentation in the OSD number space?
> 
> If it's the former, then the rank of an OSD is metadata that can change
> during its lifetime -- meaning that it's probably not a good idea to use
> it in path-names, for example.
> 
> I suspect that using UUIDs and/or human-readable labels to refer to OSDs
> is probably going to be superior than using the OSDs' rank.

The ranks don't change, so at least that part is not a problem.  If you 
remove old osds, there's a gap in the id space, but that's not a problem.

I don't think they can be hidden entirely because they are tied to the 
CRUSH map, which can be (and often must be) manipulated directly.

> > We've set up a special key that has permission to create new osds only, 
> > but again it's pretty bad security.  Chef's model just doesn't work well 
> > here.
> 
> If I understand the model correctly:
> 
>  - Each individual daemon generates its own keys.  This is a secret key 
>    in the symmetric-cryptography sense.
>  - Each daemon needs to have its keys registered with the MON cluster.
>    (The MON operates as a centrally-trusted key distribution centre.)
>  - To register keys with the MON cluster, a cluster administrative 
>    privilege is required.
>  - The registration process also updates the effective access-control 
>    privilege set associated with that key.

Alternatively, as TV noted, a 'provisioning' key with the ability only to 
add new keys with specific privs can be used.

> Two thoughts:
> 
>  1. I suspect that some small changes to the bootstrap procedure could 
>     result in a more streamlined process:
> 
>      - OSDs stop being responsible for generating their own keys.  
>        Their keys are instead generated by a MON node and are stored in 
>        the MON cluster.  As a result, the problem to be solved changes: 
> 
>          * before, an OSD needed to have the necessary privileges to
>            update the cluster-wide configuration; 
>          * now, the MON node only needs to have the necessary privileges
>            to install an OSD's secret key on that OSD's host.
> 
>      - It should then be straightforward to set up a privileged, 
>        automated process -- probably on a MON node -- to stage a copy
>        of an OSDs secret key to the appropriate location on that OSD's 
>        host.
> 
>        (This assumes that an OSDs host can be automatically determined 
>        and authenticated using some existing means (SSH keys, Kerberos, 
>        etc.) -- which I'd expect to be the case for any non-trivial 
>        installation.)
> 
>      - This automated process could be triggered by the OSD   
>        installation process -- either by an out-of-band script, or 
>        conceivably by a key-solicitation message sent in-band by the 
>        OSD itself.

I don't have much of an opinion on the strategy in general (TV probably 
does), but this is already possible currently.  If you pass --mkkey along 
with --mkfs to ceph-osd it will generate a key, but if you don't you can 
copy it into place yourself.

>  2. This sounds very similar to the model for Kerberos 5, as used in 
>     the MIT Kerberos implementation and Microsoft's Active Directory.  
>     It might be an interesting (future!) project to see how difficult 
>     it would be to modify the Ceph daemons and protocols to use 
>     Kerberos-based authentication as an alternative to cephx, possibly 
>     via the GSS-API.
> 
> Aha, I note that this is probably not a surprise -- the 0.18 
> release-notes, for example, note the similarity between cephx and
> Kerberos.  I presume the goal in rolling your own was to avoid adding
> barriers to deployment?

That, and we were concerned about scalability issues with kerberos 
itself... it didn't map cleanly only the distributed nature of Ceph.

sage

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Braindump: path names, partition labels, FHS, auto-discovery
  2012-03-27 19:40           ` Sage Weil
@ 2012-03-27 20:16             ` Tommi Virtanen
  0 siblings, 0 replies; 11+ messages in thread
From: Tommi Virtanen @ 2012-03-27 20:16 UTC (permalink / raw)
  To: Sage Weil; +Cc: David McBride, Ceph Development Mailing List

On Tue, Mar 27, 2012 at 12:40, Sage Weil <sage@newdream.net> wrote:
>> Two thoughts:
>>
>>  1. I suspect that some small changes to the bootstrap procedure could
>>     result in a more streamlined process:
>>
>>      - OSDs stop being responsible for generating their own keys.
>>        Their keys are instead generated by a MON node and are stored in
>>        the MON cluster.  As a result, the problem to be solved changes:
>>
>>          * before, an OSD needed to have the necessary privileges to
>>            update the cluster-wide configuration;
>>          * now, the MON node only needs to have the necessary privileges
>>            to install an OSD's secret key on that OSD's host.

Perhaps, but the devil is in the details. In general, there is no
single mon node, any assumption of such is unwanted. With Chef or
similar automation software. the monitors do not know when a new OSD
is being set up; they cannot easily initiate actions at that time.
Triggering actions from the storage node is just easier.

The bootstrap key mechanism is not perfect, but it plays well with the
limitations of Chef, and it works. Once the rest reaches a similar
level of functionality, I'll happily revisit it, but for now it's
plenty fine.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-03-27 20:16 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-06 21:19 Braindump: path names, partition labels, FHS, auto-discovery Tommi Virtanen
2012-03-06 22:29 ` Greg Farnum
2012-03-07  9:55 ` David McBride
2012-03-07 20:54   ` Sage Weil
2012-03-19  9:21     ` Bernard Grymonpon
2012-03-20  7:25       ` Sage Weil
2012-03-20  7:55         ` Bernard Grymonpon
2012-03-27 11:29         ` David McBride
2012-03-27 19:40           ` Sage Weil
2012-03-27 20:16             ` Tommi Virtanen
2012-03-27 18:21       ` Tommi Virtanen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.