All of lore.kernel.org
 help / color / mirror / Atom feed
* defaults paths #2
@ 2012-04-05 23:36 Tommi Virtanen
  2012-04-06  5:12 ` Sage Weil
  2012-04-06  7:37 ` Bernard Grymonpon
  0 siblings, 2 replies; 14+ messages in thread
From: Tommi Virtanen @ 2012-04-05 23:36 UTC (permalink / raw)
  To: ceph-devel

[Apologies for starting a new thread, vger unsubscribed me without
warning, I'm reading the previous thread via web]

In response to the thread at http://marc.info/?t=133360781700001&r=1&w=2


Sage:
> The locations could be:
>  keyring:
>   /etc/ceph/$cluster.keyring  (fallback to /etc/ceph/keyring)

I think all the osds and mons will have their secrets inside their
data dir. This, if used, will be just for the command line tools.

>  osd_data, mon_data:
>   /var/lib/ceph/$cluster.$name
>   /var/lib/ceph/$cluster/$name
>   /var/lib/ceph/data/$cluster.$name
>   /var/lib/ceph/$type-data/$cluster-$id

I'm thinking.. /var/lib/ceph/$type/$cluster-$id, where $type is osd or mon.
And this is what is in the wip-defaults branch now, it seems.



Bernard Grymonpon <bernard@openminds.be>:
> As a osd consists of data and the journal, it should stay together, with all info for \
> that one osd in one place:
>
> I would suggest
>
> /var/lib/ceph/osd/$id/data
> and
> /var/lib/ceph/osd/$id/journal

Journal can live inside .../data (when just a file on the same spindle
is ok), we don't need to use a directory tree level just for that.

> ($id could be replaced by $uuid or $name, for which I would prefer $uuid)

In some ways $uuid is cleaner, but this is something that is too
visible for admins, and the $id space still exists and cannot tolerate
collisions, so we might as well use those.



Andrey Korolyov <andrey@xdel.ru>:
> Right, but probably we need journal separation at the directory level
> by default, because there is a very small amount of cases when speed
> of main storage is sufficient for journal or when resulting speed
> decrease is not significant, so journal by default may go into
> /var/lib/ceph/osd/journals/$i/journal where osd/journals mounted on
> the fast disk.

Journals as files in a single, separate, dedicated filesystem (most
likely on SSD) is on my list of use cases to be supported. I don't
think we need to always use /var/lib/ceph/osd/journals just to support
that; I think I can arrange the support without that clumsiness.
Details are still pending.

I think many people with that hardware setup will choose to just GPT
partition the SSD, and have a lot less code in the way of the IO.



Bernard Grymonpon <bernard@openminds.be>:
> I feel it's up to the sysadmin to mount / symlink the correct storage devices on the \
> correct paths - ceph should not be concerned that some volumes might need to sit \
> together.

I think we agree on intents, but I disagree strongly with the words
you chose. The sysadmin should not need to symlink anything. We will
bring up a whole cluster, and enable you to manage that easily.
Anything less is a failure. You may choose to opt out of some of that
higher smarts, if you want to do it differently, but we will need to
provide that. Managing a big cluster without automation is painful,
and people shouldn't need to reinvent the wheel.

Now, whether that automation uses symlinks to point things to the
right places or not, that's almost an implementation detail.



Bernard Grymonpon <bernard@openminds.be>:
> I assume most OSD nodes will normally run a single OSD, so this would not apply to \
> most nodes.

Expected use right now is 1 OSD per hard drive, 8-12 hard drives per
server. That's what we'll be benchmarking on, primarily.



Wido den Hollander <wido@widodh.nl>:
> I think that's a wrong assumption. On most systems I think multiple OSDs
> will exist, it's debatable if one would run OSDs from different clusters
> very often.

I expect mixing clusters to be rare, but that use case has been made
strongly enough that it seems we will support it as a first-class
feature.

> I'm currently using: osd data = /var/lib/ceph/$name
>
> To get back to what sage mentioned, why add the "-data" suffix to a
> directory name? Isn't it obvious that a directory will contain data?

He was separating osd-data from osd-journal. Since that we've
simplified that to /var/lib/ceph/$type/$cluster-$id, which is close to
what you have, but 1) separating osd data dirs from rest of
/var/lib/ceph, for future expansion room 2) adding the cluster name.

> As I think it is a very specific scenario where a machine would be
> participating in multiple Ceph clusters I'd vote for:
>
>   /var/lib/ceph/$type/$id

I really want to avoid having two different cases, two different code
paths to test, a more rare variant that can break without being
noticed. I want everything to always look the same. "ceph-" seems a
small enough price to pay for that.



Bernard Grymonpon <bernard@openminds.be>:
> If it is recommended setup to have multiple OSDs per node (like, one OSD per physical \
> drive), then we need to take that in account - but don't assume that one node only \
> has one SSD disk for journals, which would be shared between all OSDs...

As I tried to explain in an earlier "braindump" email, we'll support journals

1. inside osd data
2. on shared ssd as file
3. separate block dev (e.g. ssd or raid 2nd lun with different config)

and find the right journal automagically, by matching uuids. This is
what I'm working on right now.



Bernard Grymonpon <bernard@openminds.be>:
> I would suggest you fail the startup of the daemon, as it doesn't have all the needed \
> parts - I personally don't like these "autodiscover" thingies, you never know why \
> they are waiting/searching for,...

If you don't like hotplug, you can disable it.

Having hundreds of machines with ~10 disks each and the data center
being remote and managed by people you've never seen will probably
make you like the automation more. Having failed disks turn on an LED,
ops swapping in a new drive from a dedicated pile of pre-prepared
spares, and things just.. working.. is the goal.

> Say that we duplicate a node, for some testing/failover/... I would not
> want to daemon to automatically start, just because the data is there...

If you do that, just turn hotplug off before you plug the disks in to
the replica. There's not much you can do with it, though -- starting
the osds is a no-no, in this case.

I would argue that random copying of osd data disks or journals is an
accident waiting to happen. But there's nothing inherent in the design
to say you can't do that. Just don't do hotplug.

We can't rely on hotplug working anyway. There will *always* be a
change for the admin to manually say "hey, here's a new block device
for you, see if there's something to run there".

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: defaults paths #2
  2012-04-05 23:36 defaults paths #2 Tommi Virtanen
@ 2012-04-06  5:12 ` Sage Weil
  2012-04-06 17:34   ` Tommi Virtanen
  2012-04-06  7:37 ` Bernard Grymonpon
  1 sibling, 1 reply; 14+ messages in thread
From: Sage Weil @ 2012-04-06  5:12 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: ceph-devel

On Thu, 5 Apr 2012, Tommi Virtanen wrote:
> > As I think it is a very specific scenario where a machine would be
> > participating in multiple Ceph clusters I'd vote for:
> >
> >   /var/lib/ceph/$type/$id
> 
> I really want to avoid having two different cases, two different code
> paths to test, a more rare variant that can break without being
> noticed. I want everything to always look the same. "ceph-" seems a
> small enough price to pay for that.

Here's what I'm thinking:

 - No data paths are hard-coded except for /etc/ceph/*.conf

 - We initially mount osd volumes in some temporary location (say, 
   /var/lib/ceph/temp/$uuid)

 - We identify the oid, cluster uuid, etc., and determine where to mount 
   it with

	ceph-osd --cluster $cluster -i $id --show-config-value osd_data

   This will normally give you the default, unless the conf file specified 
   something else.

 - Normal people get a default of /var/lib/ceph/$type/$id

 - Multicluster crazies put

	[global]
		osd data = /var/lib/ceph/$type/$cluster-$id
		osd journal = /var/lib/ceph/$type/$cluster-$id/journal
		mon data = /var/lib/ceph/$type/$cluster-$id

   (or whatever) in /etc/ceph/$cluster.conf and get something else.

Code paths are identical, data flow is identical.  We get a simple general 
case, without closing the door on multicluster configurations, which vary 
only by the config value that is easily adjusted on a per-cluster basis...

sage

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: defaults paths #2
  2012-04-05 23:36 defaults paths #2 Tommi Virtanen
  2012-04-06  5:12 ` Sage Weil
@ 2012-04-06  7:37 ` Bernard Grymonpon
  2012-04-06 17:57   ` Tommi Virtanen
  1 sibling, 1 reply; 14+ messages in thread
From: Bernard Grymonpon @ 2012-04-06  7:37 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: ceph-devel


On 06 Apr 2012, at 01:36, Tommi Virtanen wrote:

> [Apologies for starting a new thread, vger unsubscribed me without
> warning, I'm reading the previous thread via web]
> 
> In response to the thread at http://marc.info/?t=133360781700001&r=1&w=2
> 
> 
> Sage:
>> The locations could be:
>> keyring:
>>  /etc/ceph/$cluster.keyring  (fallback to /etc/ceph/keyring)
> 
> I think all the osds and mons will have their secrets inside their
> data dir. This, if used, will be just for the command line tools.
> 
>> osd_data, mon_data:
>>  /var/lib/ceph/$cluster.$name
>>  /var/lib/ceph/$cluster/$name
>>  /var/lib/ceph/data/$cluster.$name
>>  /var/lib/ceph/$type-data/$cluster-$id
> 
> I'm thinking.. /var/lib/ceph/$type/$cluster-$id, where $type is osd or mon.
> And this is what is in the wip-defaults branch now, it seems.

That seems a good solution. 

> 
> 
> 
> Bernard Grymonpon <bernard@openminds.be>:
>> As a osd consists of data and the journal, it should stay together, with all info for \
>> that one osd in one place:
>> 
>> I would suggest
>> 
>> /var/lib/ceph/osd/$id/data
>> and
>> /var/lib/ceph/osd/$id/journal
> 
> Journal can live inside .../data (when just a file on the same spindle
> is ok), we don't need to use a directory tree level just for that.
> 
>> ($id could be replaced by $uuid or $name, for which I would prefer $uuid)
> 
> In some ways $uuid is cleaner, but this is something that is too
> visible for admins, and the $id space still exists and cannot tolerate
> collisions, so we might as well use those.

Storage lives by UUID's, I would suggest to move the programming in the direction to get rid of all the own naming and labeling, and just stick to uuids. I could not care less if the data on that disk is internally named "23" or "678", it is just a part of my cluster, and it is up to ceph to figure out which part of the puzzle it holds.

(So, the above would then change to /var/lib/ceph/$type/$cluster/$uuid in my ideal scenario, or some variant on it) 

> 
> 
> Andrey Korolyov <andrey@xdel.ru>:
>> Right, but probably we need journal separation at the directory level
>> by default, because there is a very small amount of cases when speed
>> of main storage is sufficient for journal or when resulting speed
>> decrease is not significant, so journal by default may go into
>> /var/lib/ceph/osd/journals/$i/journal where osd/journals mounted on
>> the fast disk.
> 
> Journals as files in a single, separate, dedicated filesystem (most
> likely on SSD) is on my list of use cases to be supported. I don't
> think we need to always use /var/lib/ceph/osd/journals just to support
> that; I think I can arrange the support without that clumsiness.
> Details are still pending.
> 
> I think many people with that hardware setup will choose to just GPT
> partition the SSD, and have a lot less code in the way of the IO.

A single file, sitting on some partition somewhere, will be very hard to auto-find, unless you'll keep some info in OSD that tracks the UUID of the disk + the path inside the filesystem where it previously was. 

> 
> 
> Bernard Grymonpon <bernard@openminds.be>:
>> I feel it's up to the sysadmin to mount / symlink the correct storage devices on the \
>> correct paths - ceph should not be concerned that some volumes might need to sit \
>> together.
> 
> I think we agree on intents, but I disagree strongly with the words
> you chose. The sysadmin should not need to symlink anything. We will
> bring up a whole cluster, and enable you to manage that easily.
> Anything less is a failure. You may choose to opt out of some of that
> higher smarts, if you want to do it differently, but we will need to
> provide that. Managing a big cluster without automation is painful,
> and people shouldn't need to reinvent the wheel.

Actually, it will bring up one node of a cluster, regardless of the state of the other parts of the cluster. Question is how automated everything should be detected and started.

Hotplugging and auto-mounting is there already - it is up to the sysadmin to use it if needed.

> 
> Now, whether that automation uses symlinks to point things to the
> right places or not, that's almost an implementation detail.
> 
> 
> 
> Bernard Grymonpon <bernard@openminds.be>:
>> I assume most OSD nodes will normally run a single OSD, so this would not apply to \
>> most nodes.
> 
> Expected use right now is 1 OSD per hard drive, 8-12 hard drives per
> server. That's what we'll be benchmarking on, primarily.
> 
> 
> 
> Wido den Hollander <wido@widodh.nl>:
>> I think that's a wrong assumption. On most systems I think multiple OSDs
>> will exist, it's debatable if one would run OSDs from different clusters
>> very often.
> 
> I expect mixing clusters to be rare, but that use case has been made
> strongly enough that it seems we will support it as a first-class
> feature.
> 
>> I'm currently using: osd data = /var/lib/ceph/$name
>> 
>> To get back to what sage mentioned, why add the "-data" suffix to a
>> directory name? Isn't it obvious that a directory will contain data?
> 
> He was separating osd-data from osd-journal. Since that we've
> simplified that to /var/lib/ceph/$type/$cluster-$id, which is close to
> what you have, but 1) separating osd data dirs from rest of
> /var/lib/ceph, for future expansion room 2) adding the cluster name.
> 
>> As I think it is a very specific scenario where a machine would be
>> participating in multiple Ceph clusters I'd vote for:
>> 
>>  /var/lib/ceph/$type/$id
> 
> I really want to avoid having two different cases, two different code
> paths to test, a more rare variant that can break without being
> noticed. I want everything to always look the same. "ceph-" seems a
> small enough price to pay for that.
> 
> 
> 
> Bernard Grymonpon <bernard@openminds.be>:
>> If it is recommended setup to have multiple OSDs per node (like, one OSD per physical \
>> drive), then we need to take that in account - but don't assume that one node only \
>> has one SSD disk for journals, which would be shared between all OSDs...
> 
> As I tried to explain in an earlier "braindump" email, we'll support journals
> 
> 1. inside osd data
> 2. on shared ssd as file
> 3. separate block dev (e.g. ssd or raid 2nd lun with different config)
> 
> and find the right journal automagically, by matching uuids. This is
> what I'm working on right now.

See above - finding the journal as a file on a filesystem somewhere might give some problems to auto-detect this specific file.

> 
> 
> 
> Bernard Grymonpon <bernard@openminds.be>:
>> I would suggest you fail the startup of the daemon, as it doesn't have all the needed \
>> parts - I personally don't like these "autodiscover" thingies, you never know why \
>> they are waiting/searching for,...
> 
> If you don't like hotplug, you can disable it.

Same goes the other way - if someone like to start everything automagically, there is nothing that stops others from using the normal hotplug tools available, just write some rules to act on the presence of certain disklabels. 

Ceph can provide scripts to do this, but I would keep this logic out of the default behavior of ceph-osd. In my opinion, this is a storage daemon - not a storage-node-management-daemon.

If you want to keep this outside ceph-osd, and provide this as some helper to the /etc/init.d scripts, then I'm all the way with you, as long as the magic .

In my opinion the config file for ceph should contain only UUIDs for the devices which hold the data, like:

...
[osd]
  uuid=abc-123-123-123
  uuid=def-456-456-456
  uuid=ghi-789-789-789
...

The init.d script helper scans this file, mounts each device, does a consistency check of the data on it (and finds its auth info, internal id, etc...), and starts a osd for this uuid (how the journal is found, is still a mystery ;-)).

If there happens to be another disk with valid ceph data on it, but which should be left alone, simply not specifying the uuid in the config file would prevent it from being started (for all we know, it belongs to a different cluster, in another network/setup/...) on a normal boot. If you plug in a disk with an old dataset on it, you would not want this to happen...

> 
> Having hundreds of machines with ~10 disks each and the data center
> being remote and managed by people you've never seen will probably
> make you like the automation more. Having failed disks turn on an LED,
> ops swapping in a new drive from a dedicated pile of pre-prepared
> spares, and things just.. working.. is the goal.

I would like to know when something happens, and that it happens by my rules and wishes, not what Ceph thinks it should do. (and yes, me and my colleagues manage multiple machines, sitting in a datacenter far-far away). If some part of a setup fails, and needs replacing/maintenance, it is done on my terms, not by random magic.

In my ideal world, I would want my chef (chef, not ceph!) server to have knowledge of all the UUIDs that (might) be used in a certain ceph cluster, and then apply the correct roles to the nodes holding these disks. If the disk is empty, it would be formatted/initialized and added to the cluster. If some other disk is popped in, it should be left alone. On the node itself, that would result in a config file with all the uuids (and some info about the mons).

If someone removes or pops in a disk, what should happen then is up to me. If it starts everything automatically, or if a certain node isn't set to auto-boot, is up to me (and should result in changes in the config in my chef-server, which controls all my nodes). If I want all these UUIDs to auto-start a OSD when inserted, then I would let chef write a rules file in udev to do exactly that.

Rgds,
Bernard

> 
>> Say that we duplicate a node, for some testing/failover/... I would not
>> want to daemon to automatically start, just because the data is there...
> 
> If you do that, just turn hotplug off before you plug the disks in to
> the replica. There's not much you can do with it, though -- starting
> the osds is a no-no, in this case.
> 
> I would argue that random copying of osd data disks or journals is an
> accident waiting to happen. But there's nothing inherent in the design
> to say you can't do that. Just don't do hotplug.
> 
> We can't rely on hotplug working anyway. There will *always* be a
> change for the admin to manually say "hey, here's a new block device
> for you, see if there's something to run there".
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: defaults paths #2
  2012-04-06  5:12 ` Sage Weil
@ 2012-04-06 17:34   ` Tommi Virtanen
  2012-04-06 17:55     ` Sage Weil
  0 siblings, 1 reply; 14+ messages in thread
From: Tommi Virtanen @ 2012-04-06 17:34 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Thu, Apr 5, 2012 at 22:12, Sage Weil <sage@newdream.net> wrote:
> Here's what I'm thinking:
>
>  - No data paths are hard-coded except for /etc/ceph/*.conf
>  - We initially mount osd volumes in some temporary location (say,
>   /var/lib/ceph/temp/$uuid)
>  - We identify the oid, cluster uuid, etc., and determine where to mount
>   it with
>
>        ceph-osd --cluster $cluster -i $id --show-config-value osd_data
>
>   This will normally give you the default, unless the conf file specified
>   something else.
>  - Normal people get a default of /var/lib/ceph/$type/$id
>  - Multicluster crazies put
>
>        [global]
>                osd data = /var/lib/ceph/$type/$cluster-$id
>                osd journal = /var/lib/ceph/$type/$cluster-$id/journal
>                mon data = /var/lib/ceph/$type/$cluster-$id
>
>   (or whatever) in /etc/ceph/$cluster.conf and get something else.
>
> Code paths are identical, data flow is identical.  We get a simple general
> case, without closing the door on multicluster configurations, which vary
> only by the config value that is easily adjusted on a per-cluster basis...

Except we lost features. Now I can't iterate the contents of a
directory and know what they mean. I think we'll need that.

Basically, there's two ways of using Ceph. One is "do what ever you
want", the other one is "managed". Managed needs to really know what
goes where. If you want managed mode to always enforce configuration
to contain osd data, osd journal, mon data, just to be safe, and for
it to break the very first moment a sysadmin edits the config file, we
can do that. But there's no way the managed mode is going to be able
to be fully featured if file locations are completely based on the
configuration file. I'd rather not have too many "must be here" items
in the config file, to tempt the admin into editing it. Nothing will
ever prevent the "do what ever you want" side, but I'd like the
defaults to fit the "managed" mode.

Think of it this way: ceph.conf with arbitrary contents is a one way
mapping. I expect to need a two-way mapping.

Or, I can just make managed mode *always* pass in explicit --osd-data
etc. Then ceph.conf won't matter. I wouldn't describe that as simpler,
though.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: defaults paths #2
  2012-04-06 17:34   ` Tommi Virtanen
@ 2012-04-06 17:55     ` Sage Weil
  2012-04-06 18:00       ` Tommi Virtanen
  2012-04-06 19:23       ` Bernard Grymonpon
  0 siblings, 2 replies; 14+ messages in thread
From: Sage Weil @ 2012-04-06 17:55 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2014 bytes --]

On Fri, 6 Apr 2012, Tommi Virtanen wrote:
> On Thu, Apr 5, 2012 at 22:12, Sage Weil <sage@newdream.net> wrote:
> > Here's what I'm thinking:
> >
> >  - No data paths are hard-coded except for /etc/ceph/*.conf
> >  - We initially mount osd volumes in some temporary location (say,
> >   /var/lib/ceph/temp/$uuid)
> >  - We identify the oid, cluster uuid, etc., and determine where to mount
> >   it with
> >
> >        ceph-osd --cluster $cluster -i $id --show-config-value osd_data
> >
> >   This will normally give you the default, unless the conf file specified
> >   something else.
> >  - Normal people get a default of /var/lib/ceph/$type/$id
> >  - Multicluster crazies put
> >
> >        [global]
> >                osd data = /var/lib/ceph/$type/$cluster-$id
> >                osd journal = /var/lib/ceph/$type/$cluster-$id/journal
> >                mon data = /var/lib/ceph/$type/$cluster-$id
> >
> >   (or whatever) in /etc/ceph/$cluster.conf and get something else.
> >
> > Code paths are identical, data flow is identical.  We get a simple general
> > case, without closing the door on multicluster configurations, which vary
> > only by the config value that is easily adjusted on a per-cluster basis...
> 
> Except we lost features. Now I can't iterate the contents of a
> directory and know what they mean. I think we'll need that.

Unless you infer it from the conf value or some such kludge, but that 
would be fragile.  Okay.  I'm good with /var/lib/ceph/$type/$cluster-$id 
then.

Hopefully we can keep things as general as possible, so that brave souls 
can go out of bounds without getting bitten.  For example, never parse the 
directory name if the same information can be had from the directory 
contents.

Bernard, I suspect it would be pretty simple to make ceph-osd start up 
either via -i <id> or --uuid <uuid> which would enable a uuid-based scheme 
like you describe.  For these cookbooks, though, it'll be an <id>-based 
approach.

sage

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: defaults paths #2
  2012-04-06  7:37 ` Bernard Grymonpon
@ 2012-04-06 17:57   ` Tommi Virtanen
  2012-04-06 19:45     ` Bernard Grymonpon
  0 siblings, 1 reply; 14+ messages in thread
From: Tommi Virtanen @ 2012-04-06 17:57 UTC (permalink / raw)
  To: Bernard Grymonpon; +Cc: ceph-devel

On Fri, Apr 6, 2012 at 00:37, Bernard Grymonpon <bernard@openminds.be> wrote:
> Storage lives by UUID's, I would suggest to move the programming in the direction to get rid of all the own naming and labeling, and just stick to uuids. I could not care less if the data on that disk is internally named "23" or "678", it is just a part of my cluster, and it is up to ceph to figure out which part of the puzzle it holds.

Ceph storage does not live by UUIDs. Each object is stored in specific
osds, which are identified by a sequential, dense, unique, integer.
These integers exist, are allocated and freed, and are the very
*definition* of an osd.

UUIDs are useful for ensuring 1. that osd.42 from cluster A is not
confused with a osd.42 from cluster B (cluster-wide fsid)  2. for
ensuring we use the right journal for an osd (per-osd uuid, stored in
both osd data and journal).

UUIDs are not human friendly, and are not a native identifier in Ceph.

Here's what running multiple osds looks like with the new upstart stuff:

ubuntu@inst01:~$ sudo initctl list|grep ceph
ceph-osd-all stop/waiting
ceph-mon-all stop/waiting
ceph-mon (ceph/single) start/running, process 2527
ceph-osd (ceph/0) start/running, process 2576
ceph-osd (ceph/1) start/running, process 2580
ceph-osd (ceph/2) start/running, process 2592
ceph-osd (ceph/3) start/running, process 3053

Doing the same with UUIDs would make that output have lines like

ceph-osd (ceph/1bbb1f49-c574-45b0-b4cc-43dbe7bacc0d) start/running, process 2576

and I really, really don't want for no gain.

UUIDs make a lot of sense when you don't have central coordination of
identifiers. But Ceph *has* that. (It *needs* that, because we don't
do lookup tables.)

>> I think many people with that hardware setup will choose to just GPT
>> partition the SSD, and have a lot less code in the way of the IO.
> A single file, sitting on some partition somewhere, will be very hard to auto-find, unless you'll keep some info in OSD that tracks the UUID of the disk + the path inside the filesystem where it previously was.

For example: "All the journals that are just files are inside
/var/lib/ceph/journal, or are symlinked there." And that can even be a
configurable search path with multiple entries. Not that hard.

OSD data will not contain a path name to the journal, that is too transient.

>> I think we agree on intents, but I disagree strongly with the words
>> you chose. The sysadmin should not need to symlink anything. We will
>> bring up a whole cluster, and enable you to manage that easily.
>> Anything less is a failure. You may choose to opt out of some of that
>> higher smarts, if you want to do it differently, but we will need to
>> provide that. Managing a big cluster without automation is painful,
>> and people shouldn't need to reinvent the wheel.
>
> Actually, it will bring up one node of a cluster, regardless of the state of the other parts of the cluster. Question is how automated everything should be detected and started.
>
> Hotplugging and auto-mounting is there already - it is up to the sysadmin to use it if needed.

The automation I'm working is intended to bring up and manage a *whole
cluster*. Not a single machine.

Hotplugging is *not* there already. There is no turn-key
installs-in-under-60-minutes solution that gives you that.

> Same goes the other way - if someone like to start everything automagically, there is nothing that stops others from using the normal hotplug tools available, just write some rules to act on the presence of certain disklabels.

I am writing that logic, right now. For everyone to use. So that
everyone doesn't need to reinvent the wheel.

> Ceph can provide scripts to do this, but I would keep this logic out of the default behavior of ceph-osd. In my opinion, this is a storage daemon - not a storage-node-management-daemon.
>
> If you want to keep this outside ceph-osd, and provide this as some helper to the /etc/init.d scripts, then I'm all the way with you, as long as the magic .

It will all be optional, though you will be encouraged to use the automation.

> In my opinion the config file for ceph should contain only UUIDs for the devices which hold the data, like:
>
> ...
> [osd]
>  uuid=abc-123-123-123
>  uuid=def-456-456-456
>  uuid=ghi-789-789-789
> ...

We are talking about hundreds of machines and thousands of disks.
Nobody I've talked to wants to edit a config file to add/replace a
disk on a system that big, that's just not the way to go.

> If someone removes or pops in a disk, what should happen then is up to me. If it starts everything automatically, or if a certain node isn't set to auto-boot, is up to me (and should result in changes in the config in my chef-server, which controls all my nodes). If I want all these UUIDs to auto-start a OSD when inserted, then I would let chef write a rules file in udev to do exactly that.

It seems you have different enough opinions about how to manage your
systems that you might choose to not to use the automation I've been
working on. That's fine, just do it. The core ceph daemons will not
assume things either way.

I would recommend you come back in a few months and see what come out
of the work, you might find it more acceptable than you currently
think.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: defaults paths #2
  2012-04-06 17:55     ` Sage Weil
@ 2012-04-06 18:00       ` Tommi Virtanen
  2012-04-06 19:23       ` Bernard Grymonpon
  1 sibling, 0 replies; 14+ messages in thread
From: Tommi Virtanen @ 2012-04-06 18:00 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Fri, Apr 6, 2012 at 10:55, Sage Weil <sage@newdream.net> wrote:
> Hopefully we can keep things as general as possible, so that brave souls
> can go out of bounds without getting bitten.  For example, never parse the
> directory name if the same information can be had from the directory
> contents.

I think one good rule is that we dictate what /var/lib/ceph/osd/*
means and how it behaves, and if you want to do custom things, use
e.g. /srv/example.com/osd/*, disable/usurp the automation where it
doesn't suit you, and use ceph.conf to adjust for the new paths.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: defaults paths #2
  2012-04-06 17:55     ` Sage Weil
  2012-04-06 18:00       ` Tommi Virtanen
@ 2012-04-06 19:23       ` Bernard Grymonpon
  1 sibling, 0 replies; 14+ messages in thread
From: Bernard Grymonpon @ 2012-04-06 19:23 UTC (permalink / raw)
  To: Sage Weil; +Cc: Tommi Virtanen, ceph-devel


On 06 Apr 2012, at 19:55, Sage Weil wrote:

> On Fri, 6 Apr 2012, Tommi Virtanen wrote:
>> On Thu, Apr 5, 2012 at 22:12, Sage Weil <sage@newdream.net> wrote:
>>> Here's what I'm thinking:
>>> 
>>>  - No data paths are hard-coded except for /etc/ceph/*.conf
>>>  - We initially mount osd volumes in some temporary location (say,
>>>   /var/lib/ceph/temp/$uuid)
>>>  - We identify the oid, cluster uuid, etc., and determine where to mount
>>>   it with
>>> 
>>>        ceph-osd --cluster $cluster -i $id --show-config-value osd_data
>>> 
>>>   This will normally give you the default, unless the conf file specified
>>>   something else.
>>>  - Normal people get a default of /var/lib/ceph/$type/$id
>>>  - Multicluster crazies put
>>> 
>>>        [global]
>>>                osd data = /var/lib/ceph/$type/$cluster-$id
>>>                osd journal = /var/lib/ceph/$type/$cluster-$id/journal
>>>                mon data = /var/lib/ceph/$type/$cluster-$id
>>> 
>>>   (or whatever) in /etc/ceph/$cluster.conf and get something else.
>>> 
>>> Code paths are identical, data flow is identical.  We get a simple general
>>> case, without closing the door on multicluster configurations, which vary
>>> only by the config value that is easily adjusted on a per-cluster basis...
>> 
>> Except we lost features. Now I can't iterate the contents of a
>> directory and know what they mean. I think we'll need that.
> 
> Unless you infer it from the conf value or some such kludge, but that 
> would be fragile.  Okay.  I'm good with /var/lib/ceph/$type/$cluster-$id 
> then.
> 
> Hopefully we can keep things as general as possible, so that brave souls 
> can go out of bounds without getting bitten.  For example, never parse the 
> directory name if the same information can be had from the directory 
> contents.
> 
> Bernard, I suspect it would be pretty simple to make ceph-osd start up 
> either via -i <id> or --uuid <uuid> which would enable a uuid-based scheme 
> like you describe.  For these cookbooks, though, it'll be an <id>-based 
> approach.

Sure, I can live with that - I just give you my opinion. I'm happy that low-level tools and options will emerge to control very small parts and bits of ceph, which are needed to build solid and working deployment solutions.

If you provide "official" ceph cookbooks that don't work the way we like it, and manage things differently, then we'll build and maintain ours (like we already did), and take best of both to our opinion. It just bothers me that not only the storage-part is controlled and laid out (which is normal, ceph is about the storage), but now the way I manage my machines would be forced to be the ceph-way, and this is something I would not appreciate as a sysadmin.

But it seems as I misunderstood Tommi partially - none of the changes would go in the daemons, but are just scripts that will be provided to help - and I'm very much welcoming this. As long as ceph-osd is a simple storage daemon, and no fancy scan-it-all-and-start-as-you-feel daemon, I'm a happy person.

Rgds,
Bernard

> 
> sage


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: defaults paths #2
  2012-04-06 17:57   ` Tommi Virtanen
@ 2012-04-06 19:45     ` Bernard Grymonpon
  2012-04-09 18:03       ` Tommi Virtanen
  0 siblings, 1 reply; 14+ messages in thread
From: Bernard Grymonpon @ 2012-04-06 19:45 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: ceph-devel


On 06 Apr 2012, at 19:57, Tommi Virtanen wrote:

> On Fri, Apr 6, 2012 at 00:37, Bernard Grymonpon <bernard@openminds.be> wrote:
>> Storage lives by UUID's, I would suggest to move the programming in the direction to get rid of all the own naming and labeling, and just stick to uuids. I could not care less if the data on that disk is internally named "23" or "678", it is just a part of my cluster, and it is up to ceph to figure out which part of the puzzle it holds.
> 
> Ceph storage does not live by UUIDs. Each object is stored in specific
> osds, which are identified by a sequential, dense, unique, integer.
> These integers exist, are allocated and freed, and are the very
> *definition* of an osd.
> 
> UUIDs are useful for ensuring 1. that osd.42 from cluster A is not
> confused with a osd.42 from cluster B (cluster-wide fsid)  2. for
> ensuring we use the right journal for an osd (per-osd uuid, stored in
> both osd data and journal).
> 
> UUIDs are not human friendly, and are not a native identifier in Ceph.
> 
> Here's what running multiple osds looks like with the new upstart stuff:
> 
> ubuntu@inst01:~$ sudo initctl list|grep ceph
> ceph-osd-all stop/waiting
> ceph-mon-all stop/waiting
> ceph-mon (ceph/single) start/running, process 2527
> ceph-osd (ceph/0) start/running, process 2576
> ceph-osd (ceph/1) start/running, process 2580
> ceph-osd (ceph/2) start/running, process 2592
> ceph-osd (ceph/3) start/running, process 3053
> 
> Doing the same with UUIDs would make that output have lines like
> 
> ceph-osd (ceph/1bbb1f49-c574-45b0-b4cc-43dbe7bacc0d) start/running, process 2576
> 
> and I really, really don't want for no gain.
> 
> UUIDs make a lot of sense when you don't have central coordination of
> identifiers. But Ceph *has* that. (It *needs* that, because we don't
> do lookup tables.)

Agreed, it looks nicer, but I would like to get some report about a UUID not being found, which I then can track down to a serial number of a drive/lun/partition/lvm part/... or some other thing. I don't know that ceph/536 is stored on device xyz, without loading ceph magic and checking the effective content of the disk.

Lets go wild, and say, if you have hunderds of machines, summing up to thousands of of disks, all already migrated/moved to other machines/... , and it reports that OSD 536 is offline, how will you find what disk is failing/corrupt/... in which machine? Will you keep track which OSD ran on which node last?

I think of my storage (in ceph) of a stash of harddisks, in random order. If my cluster is made up of 5, then I need 5 disks. 

I was thinking of adding a custom label next to the UUID of a partition/disk/... like "ceph.$clusterid.$id" (or even "ceph.$clusterid.$osd.$id, which might help solve this problem. As such, you would know, without mounting the data, which data you're dealing with.

And I do understand that ceph needs the ids, and it is an cornerstone of the inter workings, but as a sysadmin, that is not my problem. I do drives, each of them holding data. 

> 
>>> I think many people with that hardware setup will choose to just GPT
>>> partition the SSD, and have a lot less code in the way of the IO.
>> A single file, sitting on some partition somewhere, will be very hard to auto-find, unless you'll keep some info in OSD that tracks the UUID of the disk + the path inside the filesystem where it previously was.
> 
> For example: "All the journals that are just files are inside
> /var/lib/ceph/journal, or are symlinked there." And that can even be a
> configurable search path with multiple entries. Not that hard.
> 
> OSD data will not contain a path name to the journal, that is too transient.
> 
>>> I think we agree on intents, but I disagree strongly with the words
>>> you chose. The sysadmin should not need to symlink anything. We will
>>> bring up a whole cluster, and enable you to manage that easily.
>>> Anything less is a failure. You may choose to opt out of some of that
>>> higher smarts, if you want to do it differently, but we will need to
>>> provide that. Managing a big cluster without automation is painful,
>>> and people shouldn't need to reinvent the wheel.
>> 
>> Actually, it will bring up one node of a cluster, regardless of the state of the other parts of the cluster. Question is how automated everything should be detected and started.
>> 
>> Hotplugging and auto-mounting is there already - it is up to the sysadmin to use it if needed.
> 
> The automation I'm working is intended to bring up and manage a *whole
> cluster*. Not a single machine.
> 
> Hotplugging is *not* there already. There is no turn-key
> installs-in-under-60-minutes solution that gives you that.

Yet, chef manages on each run, exactly one node. Either chef will need to have some basic knowledge about the layout (like, where are the monitors, and what is the key), or there needs to be magic on the nodes themself to share this info between good and new nodes.

> 
>> Same goes the other way - if someone like to start everything automagically, there is nothing that stops others from using the normal hotplug tools available, just write some rules to act on the presence of certain disklabels.
> 
> I am writing that logic, right now. For everyone to use. So that
> everyone doesn't need to reinvent the wheel.
> 
>> Ceph can provide scripts to do this, but I would keep this logic out of the default behavior of ceph-osd. In my opinion, this is a storage daemon - not a storage-node-management-daemon.
>> 
>> If you want to keep this outside ceph-osd, and provide this as some helper to the /etc/init.d scripts, then I'm all the way with you, as long as the magic .
> 
> It will all be optional, though you will be encouraged to use the automation.

... and this is where I think I misunderstood you - I assumed this fancy voodoo would go in ceph-osd. 

> 
>> In my opinion the config file for ceph should contain only UUIDs for the devices which hold the data, like:
>> 
>> ...
>> [osd]
>>  uuid=abc-123-123-123
>>  uuid=def-456-456-456
>>  uuid=ghi-789-789-789
>> ...
> 
> We are talking about hundreds of machines and thousands of disks.
> Nobody I've talked to wants to edit a config file to add/replace a
> disk on a system that big, that's just not the way to go.

Editting a json controlled file on a central provisioning server isn't really a daunting task. If you have a cluster of thousands of disks, I hope someone took the time to create a simple script to prepare and add a new drive in the cluster, or replace a failed drive. We would do this with specific recipes in chef. 

Each chef run would pick up the drives available to the system, check their uuid/label if they are known to belong to the cluster (should the mon know this? It knows there was a osd with id 456, part of the cluster...), and write out the configs and start the osds.

Kind regards,
Bernard

> 
>> If someone removes or pops in a disk, what should happen then is up to me. If it starts everything automatically, or if a certain node isn't set to auto-boot, is up to me (and should result in changes in the config in my chef-server, which controls all my nodes). If I want all these UUIDs to auto-start a OSD when inserted, then I would let chef write a rules file in udev to do exactly that.
> 
> It seems you have different enough opinions about how to manage your
> systems that you might choose to not to use the automation I've been
> working on. That's fine, just do it. The core ceph daemons will not
> assume things either way.
> 
> I would recommend you come back in a few months and see what come out
> of the work, you might find it more acceptable than you currently
> think.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: defaults paths #2
  2012-04-06 19:45     ` Bernard Grymonpon
@ 2012-04-09 18:03       ` Tommi Virtanen
  2012-04-09 18:16         ` Sage Weil
  0 siblings, 1 reply; 14+ messages in thread
From: Tommi Virtanen @ 2012-04-09 18:03 UTC (permalink / raw)
  To: Bernard Grymonpon; +Cc: ceph-devel

On Fri, Apr 6, 2012 at 12:45, Bernard Grymonpon <bernard@openminds.be> wrote:
> Lets go wild, and say, if you have hunderds of machines, summing up to thousands of of disks, all already migrated/moved to other machines/... , and it reports that OSD 536 is offline, how will you find what disk is failing/corrupt/... in which machine? Will you keep track which OSD ran on which node last?

That's a good question and I don't have a good enough answer for you
yet. Rest assured that's a valid concern.

It seems we're still approaching this from different angles. You want
to have an inventory of disks, known by uuid, and want to track where
they are, and plan their moves.

I want to know I have N servers with K hdd slots each, and I want each
one to be fully populated with healthy disks. I don't care what disk
is where, and I don't think it's realistic for me to maintain a manual
inventory. A failed disk means unplug that disk. An empty slot means
plug in a disk from the dedicated pile of spares. A chassis needing
maintenance is to be shut down, disks unplugged & plugged in
elsewhere. I don't care where. A lost disk needs to have its osd
deleted at some point (or just let them pile up; not a realistic
problem for a decade or so). Any inventory of disks is only realistic
from the discovery angle; just report what's plugged in right now.

I consider individual disks just about as uninteresting as power
supplies. Does that make sense?

Details pending..

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: defaults paths #2
  2012-04-09 18:03       ` Tommi Virtanen
@ 2012-04-09 18:16         ` Sage Weil
  2012-04-09 19:22           ` Tommi Virtanen
  0 siblings, 1 reply; 14+ messages in thread
From: Sage Weil @ 2012-04-09 18:16 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Bernard Grymonpon, ceph-devel

On Mon, 9 Apr 2012, Tommi Virtanen wrote:
> On Fri, Apr 6, 2012 at 12:45, Bernard Grymonpon <bernard@openminds.be> wrote:
> > Lets go wild, and say, if you have hunderds of machines, summing up to thousands of of disks, all already migrated/moved to other machines/... , and it reports that OSD 536 is offline, how will you find what disk is failing/corrupt/... in which machine? Will you keep track which OSD ran on which node last?
> 
> That's a good question and I don't have a good enough answer for you
> yet. Rest assured that's a valid concern.
> 
> It seems we're still approaching this from different angles. You want
> to have an inventory of disks, known by uuid, and want to track where
> they are, and plan their moves.
> 
> I want to know I have N servers with K hdd slots each, and I want each
> one to be fully populated with healthy disks. I don't care what disk
> is where, and I don't think it's realistic for me to maintain a manual
> inventory. A failed disk means unplug that disk. An empty slot means
> plug in a disk from the dedicated pile of spares. A chassis needing
> maintenance is to be shut down, disks unplugged & plugged in
> elsewhere. I don't care where. A lost disk needs to have its osd
> deleted at some point (or just let them pile up; not a realistic
> problem for a decade or so). Any inventory of disks is only realistic
> from the discovery angle; just report what's plugged in right now.
> 
> I consider individual disks just about as uninteresting as power
> supplies. Does that make sense?

One thing we need to keep in mind here is that the individual disks are 
placed in the CRUSH hierarchy based on the host/rack/etc location in the 
datacenter.  Moving disk around arbitrarily will break the placement 
constraints if that position isn't also changed.

sage

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: defaults paths #2
  2012-04-09 18:16         ` Sage Weil
@ 2012-04-09 19:22           ` Tommi Virtanen
  2012-04-12 15:49             ` Bernard Grymonpon
  0 siblings, 1 reply; 14+ messages in thread
From: Tommi Virtanen @ 2012-04-09 19:22 UTC (permalink / raw)
  To: Sage Weil; +Cc: Bernard Grymonpon, ceph-devel

On Mon, Apr 9, 2012 at 11:16, Sage Weil <sage@newdream.net> wrote:
> One thing we need to keep in mind here is that the individual disks are
> placed in the CRUSH hierarchy based on the host/rack/etc location in the
> datacenter.  Moving disk around arbitrarily will break the placement
> constraints if that position isn't also changed.

Yeah, the location will have to be updated. I tend to think disks
*will* move, and it's better to cope with it than to think it won't
happen. All you need is a simple power supply/mobo/raid
controller/nic/etc failure, if there's any free slots anywhere it's
probably better to plug the disks in there than waiting for a
replacement part. I'm working under the assumption that it's better to
"just bring them up" rather than having an extended osd outage or
claiming the osd as lost.

Updating the new location for the osd could be something we do even at
every osd start -- it's a nop if the location is the same as the old
one. And we can say the host knows where it is, and that information
is available in /etc or /var/lib/ceph.

I'll come back to this once it's a little bit more concrete; I'd
rather not make speculative changes, until I can actual trigger the
behavior in a test bench.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: defaults paths #2
  2012-04-09 19:22           ` Tommi Virtanen
@ 2012-04-12 15:49             ` Bernard Grymonpon
  2012-04-12 16:08               ` Sage Weil
  0 siblings, 1 reply; 14+ messages in thread
From: Bernard Grymonpon @ 2012-04-12 15:49 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Sage Weil, ceph-devel


On 09 Apr 2012, at 21:22, Tommi Virtanen wrote:

> On Mon, Apr 9, 2012 at 11:16, Sage Weil <sage@newdream.net> wrote:
>> One thing we need to keep in mind here is that the individual disks are
>> placed in the CRUSH hierarchy based on the host/rack/etc location in the
>> datacenter.  Moving disk around arbitrarily will break the placement
>> constraints if that position isn't also changed.
> 
> Yeah, the location will have to be updated. I tend to think disks
> *will* move, and it's better to cope with it than to think it won't
> happen. All you need is a simple power supply/mobo/raid
> controller/nic/etc failure, if there's any free slots anywhere it's
> probably better to plug the disks in there than waiting for a
> replacement part. I'm working under the assumption that it's better to
> "just bring them up" rather than having an extended osd outage or
> claiming the osd as lost.

I've updated my recipes to support disk moving now (and multi-mon clusters, btw), and have moved from

/var/lib/ceph/osd/$clustername-$id 

to

/var/lib/ceph/osd/$clustername-$uuid

It just isn't pretty to mount a disk in a temp place, check the "whoami" file, and then umount and remount everything on a certain ID. It is all automatically handled, and I think this feels okay.

The disks are detected by the label, which I made "$cluster.ceph". If such a label is detected, the disk is mounted, the whoami file is read, and the OSD is started with the correct parameters. If the whoami file is not present, the OSD is initialized and added to the mons...

Input would be much appreciated - both using chef (one node at a time), and rather "building" a cluster instead of initializing a full cluster at once, makes the setup a bit strange sometimes (I don't know how the amount of pg's is determined, or can be suggested, when creating a cluster).

> Updating the new location for the osd could be something we do even at
> every osd start -- it's a nop if the location is the same as the old
> one. And we can say the host knows where it is, and that information
> is available in /etc or /var/lib/ceph.

I also got to the point where I want to update the location of an OSD when bringing a OSD online.

Adding a new (bare) disk (and OSD) is easy: 

ceph osd crush add 3 osd.3 1 pool=default host=2 rack=1

(with host=2 and rack=1 coming from the node itself, somehow - it would be easy if we could use alfanumerical hostnames in those parameters...)

If there would be a 

ceph osd crush update 3 osd.3 pool=default host=3 rack=2

command, that would solve the whole location problem.

Rgds,
Bernard

> 
> I'll come back to this once it's a little bit more concrete; I'd
> rather not make speculative changes, until I can actual trigger the
> behavior in a test bench.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: defaults paths #2
  2012-04-12 15:49             ` Bernard Grymonpon
@ 2012-04-12 16:08               ` Sage Weil
  0 siblings, 0 replies; 14+ messages in thread
From: Sage Weil @ 2012-04-12 16:08 UTC (permalink / raw)
  To: Bernard Grymonpon; +Cc: Tommi Virtanen, ceph-devel

On Thu, 12 Apr 2012, Bernard Grymonpon wrote:
> 
> On 09 Apr 2012, at 21:22, Tommi Virtanen wrote:
> 
> > On Mon, Apr 9, 2012 at 11:16, Sage Weil <sage@newdream.net> wrote:
> >> One thing we need to keep in mind here is that the individual disks are
> >> placed in the CRUSH hierarchy based on the host/rack/etc location in the
> >> datacenter.  Moving disk around arbitrarily will break the placement
> >> constraints if that position isn't also changed.
> > 
> > Yeah, the location will have to be updated. I tend to think disks
> > *will* move, and it's better to cope with it than to think it won't
> > happen. All you need is a simple power supply/mobo/raid
> > controller/nic/etc failure, if there's any free slots anywhere it's
> > probably better to plug the disks in there than waiting for a
> > replacement part. I'm working under the assumption that it's better to
> > "just bring them up" rather than having an extended osd outage or
> > claiming the osd as lost.
> 
> I've updated my recipes to support disk moving now (and multi-mon clusters, btw), and have moved from
> 
> /var/lib/ceph/osd/$clustername-$id 
> 
> to
> 
> /var/lib/ceph/osd/$clustername-$uuid
> 
> It just isn't pretty to mount a disk in a temp place, check the "whoami" 
> file, and then umount and remount everything on a certain ID. It is all 
> automatically handled, and I think this feels okay.
> 
> The disks are detected by the label, which I made "$cluster.ceph". If 
> such a label is detected, the disk is mounted, the whoami file is read, 
> and the OSD is started with the correct parameters. If the whoami file 
> is not present, the OSD is initialized and added to the mons...
> 
> Input would be much appreciated - both using chef (one node at a time), 
> and rather "building" a cluster instead of initializing a full cluster 
> at once, makes the setup a bit strange sometimes (I don't know how the 
> amount of pg's is determined, or can be suggested, when creating a 
> cluster).

The pg_num will eventually be able to be adjusted manually, or auto-scale 
to the size of the cluster or pool.  That's a couple versions out still, 
but coming up soon.

> > Updating the new location for the osd could be something we do even at
> > every osd start -- it's a nop if the location is the same as the old
> > one. And we can say the host knows where it is, and that information
> > is available in /etc or /var/lib/ceph.
> 
> I also got to the point where I want to update the location of an OSD when bringing a OSD online.
> 
> Adding a new (bare) disk (and OSD) is easy: 
> 
> ceph osd crush add 3 osd.3 1 pool=default host=2 rack=1
> 
> (with host=2 and rack=1 coming from the node itself, somehow - it would be easy if we could use alfanumerical hostnames in those parameters...)
> 
> If there would be a 
> 
> ceph osd crush update 3 osd.3 pool=default host=3 rack=2
> 
> command, that would solve the whole location problem.

Added this to the tracker, #2268.

Thanks!
sage


> 
> Rgds,
> Bernard
> 
> > 
> > I'll come back to this once it's a little bit more concrete; I'd
> > rather not make speculative changes, until I can actual trigger the
> > behavior in a test bench.
> 
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-04-12 16:08 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-05 23:36 defaults paths #2 Tommi Virtanen
2012-04-06  5:12 ` Sage Weil
2012-04-06 17:34   ` Tommi Virtanen
2012-04-06 17:55     ` Sage Weil
2012-04-06 18:00       ` Tommi Virtanen
2012-04-06 19:23       ` Bernard Grymonpon
2012-04-06  7:37 ` Bernard Grymonpon
2012-04-06 17:57   ` Tommi Virtanen
2012-04-06 19:45     ` Bernard Grymonpon
2012-04-09 18:03       ` Tommi Virtanen
2012-04-09 18:16         ` Sage Weil
2012-04-09 19:22           ` Tommi Virtanen
2012-04-12 15:49             ` Bernard Grymonpon
2012-04-12 16:08               ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.